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Viable  prefixes  play  an  important  role  in  LR  parsing  theory.  In  the  work  presented 
here,  viable  prefixes  have  a commensurately  central  role  in  a theory  of  general  context-free 
recognition  and  parsing. 

A set-theoretic  framework  for  describing  general  context-free  recognition  is  presented. 
The  operators  and  operands  in  the  framework  are  regularity-preserving  relations  and  regular 
sets  of  viable  prefixes,  respectively.  A basic  operation  consists  of  computing  the  image  of  a 
regular  set  of  viable  prefixes  under  one  of  the  relations.  By  extension,  general  recognition  is 
characterized  in  terms  of  computing  a sequence  of  regular  sets. 

For  implementation  purposes,  finite-state  automata  are  used  to  represent  the  regular 
sets.  A general  bottom-up  recognizer  that  constructs  an  appropriate  sequence  of  automata  is 
described  in  detail.  The  regular  languages  accepted  by  these  automata  correspond  to  the 
sets  of  viable  prefixes  computed  by  the  recognizer’s  set-theoretic  counterpart.  The  automata 
are  constructed  under  the  guidance  of  a control  automaton  which  accepts  the  viable  prefixes 
of  the  subject  grammar.  Ultimately,  the  automata-based  recognizer  is  extended  to  a truly 
general  bottom-up  parser. 

Earley’s  algorithm  is  analyzed  in  the  context  of  our  viable  prefix-based  framework  as  it 
provides  a convenient  vehicle  for  illustrating  some  of  our  ideas.  We  describe  how  Earley’s 

vii 


algorithm  implicitly  tracks  the  sets  of  viable  prefixes  that  arise  in  our  model.  Moreover,  by 
modifying  Earley’s  recognizer  to  construct  a certain  directed  graph,  the  representation  of 
these  sets  is  made  explicit. 

Our  set-theoretic  framework  yields  elegant  and  succinct  characterizations  of  general 
context-free  recognition  that  appear  to  capture  the  essence  of  the  task.  On  the  practical 
front,  a general  bottom-up  parser  is  described  in  sufficient  detail  to  be  readily  implemented. 
Although  its  practical  potential  is  not  evaluated  here,  the  parser  is  intended  for  use  in  prob- 
lem areas  that  require  more  flexible  parsers  than  are  provided  within  the  efficient  but  re- 
stricted LR  framework.  Regardless,  our  viable  prefix-based  treatment  of  recognition  and 
parsing  provides  a particularly  appropriate  framework  within  which  the  continuum  between 
LR  parsers  and  our  general  parsers  may  be  further  investigated. 
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CHAPTER  I 
INTRODUCTION 


Context-free  recognition  is  the  algorithmic  process  by  which  the  membership  of  a string 
x within  a context-free  language  L is  decided.  This  involves  determining  whether  x is 
derived  by  some  context-free  grammar  G where  L = L(G).  Parsing  is  the  process  of  ascer- 
taining the  syntactic  structure  imparted  to  x by  G. 

From  a theoretical  standpoint,  context-free  recognition  and  parsing  hold  considerable 
interest  in  their  own  right.  Yet  context-free  grammars  and  their  recognizers  and  parsers 
have  substantial  practical  value  as  well.  Most  notably,  results  from  parsing  theory  have 
proven  indispensable  to  the  implementation  of  programming  languages.  Other  areas  of  appli- 
cation include  natural  language  processing  [34],  syntactic  pattern  recognition  [18],  and  code 
generation  in  compilers  [10]. 

Given  an  arbitrary  grammar  G and  an  arbitrary  string  x over  the  terminal  alphabet  of 
G,  a general  recognizer  (resp.  parser)  recognizes  (resp.  parses)  x with  respect  to  G.  The 
work  presented  here  contributes  to  the  area  of  general  context-free  recognition  and  parsing. 
The  following  section  provides  some  motivation  and  a brief  overview  of  this  dissertation. 

Overview 

The  LR  parsers,  namely  those  parsers  that  effect  a Left-to-right  scan  of  the  input  while 
producing  a Right  parse,  define  the  most  powerful  class  of  deterministic  parsers.  Earley’s 
algorithm,  on  the  other  hand,  is  arguably  the  most  efficient  general  parser.  Despite  the  fact 
that  LR  parsers  are  restricted  to  LR(fc)  grammars  whereas  Earley’s  algorithm  can  parse 
strings  against  any  context-free  grammar,  there  are  close  parallels  between  the  two. 
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Both  LR  parsers  and  Earley’s  algorithm  are  based  on  items.  Each  state  of  an  LR 
parser  corresponds  to  a set  of  LR  items.  Earley’s  algorithm  constructs  a sequence  of  state 
sets  during  recognition.  The  states  manipulated  by  Earley’s  algorithm  — call  them  Earley 
states  — are  slightly  elaborated  LR  items. 

Earley’s  algorithm  and  LR  parsers  scan  the  input  string  from  left  to  right  recognizing 
an  incrementally-longer  prefix  of  it  in  the  process.  That  is,  they  are  correct-prefix  recogniz- 
ers. 

Both  LR  parsers  and  Earley’s  algorithm  work  in  a bottom-up  fashion.  An  LR  parser 
determines  the  reversed  rightmost  derivation  of  an  input  string.  In  contrast,  Earley’s  algo- 
rithm has  the  capability  of  producing  all  of  the  reversed  rightmost  derivations  of  an  input 
string. 

The  relationship  between  Earley’s  algorithm  and  LR  parsers  can  be  described  on  a 
more  fundamental  level  in  terms  of  viable  prefixes.  Viable  prefixes  are  certain  prefixes  of 
right  sentential  forms.  At  each  point  during  a parse,  the  contents  of  an  LR  parser’s  stack 
implicitly  represents  a viable  prefix  which  derives  the  portion  of  the  input  string  parsed  to 
that  point.  We  let  VP(G)  denote  the  set  of  viable  prefixes  of  a grammar  G . In  addition,  let 
VP(G',  x ) denote  the  set  of  those  viable  prefixes  of  G which  derive  x,  a string  over  the  termi- 
nal alphabet  of  G. 

Turning  now  to  Earley’s  algorithm,  consider  a point  in  a parse  at  which  some  prefix  x 
of  the  input  string  has  been  processed.  The  sequence  of  Earley  state  sets  constructed  up  to 
that  point  encapsulates  the  strings  in  VP(G,x).  The  manner  in  which  VP(G,x)  is  normally 
represented  in  the  state  sets  is  rather  indirect.  However,  this  representation  can  be  made 
explicit  through  a variant  of  Earley’s  algorithm  which  constructs  a directed  graph  whose  ver- 
tices are  the  states  generated  by  the  original  algorithm.  Under  an  appropriate  interpreta- 
tion, this  graph  yields  a finite-state  automaton  which  accepts  VP(<7,x).  Details  of  this  pro- 
posed graphical  variant  of  Earley’s  algorithm  are  supplied  later. 
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Given  an  arbitrary  grammar  G and  an  arbitrary  string  x over  the  terminal  alphabet  of 
G,  VP(G',x)  is  a regular  language.  This  fact  can  be  established  analytically.  Alternatively, 
the  graphical  variant  of  Earley’s  algorithm  mentioned  above  provides  a constructive  proof  of 
this  result. 

In  light  of  these  observations,  the  primary  thrust  of  this  work  is  on  the  formal  develop- 
ment of  an  approach  to  general  context-free  recognition  and  parsing  that  is  based  on  expli- 
citly computing  VP(G',a-)  for  an  incrementally-longer  prefix  x of  the  input  string.  In  particu- 
lar, the  viable  prefix  is  the  central  concept  upon  which  useful  general  recognizers  and  parsers 
are  founded.  The  development  is  rigorous,  yet  we  strive  for  clarity  and  elegance  by  resorting 
to  basic  principles  wherever  possible.  In  short,  our  approach  to  general  recognition  and  pars- 
ing generalizes  the  role  played  by  viable  prefixes  in  LR  parsers  in  order  to  accommodate 
arbitrary  grammars. 

This  work  consists  of  three  logical  divisions.  In  the  first  (Chapters  III  and  IV),  the 
mathematical  foundation  for  our  viable  prefix-based  approach  to  recognition  and  parsing  is 
developed.  The  basic  tools  are  a handful  of  binary  relations  on  strings.  General  recognition 
is  described  using  these  relations  and  simple  set-theoretic  concepts.  A key  property  of  the 
relations  is  that  they  preserve  regularity.  Consequently,  general  top-down  and  bottom-up 
recognition  schemes  are  defined  in  terms  of  computing  the  images  of  regular  sets  of  viable 
prefixes  under  these  relations.  In  short,  general  recognition  is  reduced  to  computing  a 
sequence  of  regular  sets. 

In  the  second  major  division  (Chapter  V),  Earley’s  algorithm  is  used  as  a vehicle  for 
demonstrating  the  efficacy  of  our  set-theoretic  approach  to  general  recognition.  In  particu- 
lar, the  graph-based  variant  of  Earley’s  algorithm  is  presented  there.  This  modified  algo- 
rithm illustrates  one  way  in  which  VP(G,ar)  can  be  explicitly  computed  where  x is  a prefix 
of  the  input  string.  In  the  process  of  analyzing  our  Earley  derivative,  some  subtle  properties 
of  Earley’s  original  algorithm  are  also  revealed  and  its  relationship  with  LR  parsers  is 


clarified. 
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The  last  part  of  this  work  (Chapters  VI  and  VII),  casts  our  approach  to  general  recogni- 
tion and  parsing  into  an  automata-theoretic  framework.  First,  a general  recognizer  is 
described  in  considerable  detail.  The  recognizer  uses  an  automaton  which  accepts  VP(G)  to 
guide  the  construction  of  an  automaton  that  accepts  VP(Gr,i),  where  x is  some  prefix  of  the 
input  string.  For  convenience,  the  description  of  the  algorithm  employs  the  LR(O)  automa- 
ton of  G as  the  guiding  automaton.  However,  the  algorithm  allows  for  a rather  broad  range 
of  VP(G)-accepting  automata  to  be  used  instead.  For  example,  employing  the  nondeter- 
ministie  LR(0)  automaton  of  G as  a controlling  automaton  yields  a general  recognizer  which 
works  quite  similarly  to  our  graph-based  Earley  algorithm.  Finally,  this  automata-based 
recognizer  is  extended  to  a general  parser.  Means  for  representing  parse  forests  and  handling 
ambiguity  are  described.  The  recognizer  and  parser  are  presented  in  enough  detail  to  be 
readily  implemented.  In  anticipation  of  this,  many  practical  issues  are  discussed. 

Literature  Review 

A comprehensive  introduction  to  formal  languages  and  automata  is  presented  by  Hop- 
croft  and  Ullman  [24],  These  two  related  disciplines  are  prerequisites  to  a study  of  context- 
free  recognition  and  parsing.  An  up-to-date  monograph  on  parsing  theory  has  been  written 
by  Sippu  and  Soisalon-Soininen  [39].  Two  volumes  by  Aho  and  Ullman  [6,7]  contain  a wealth 
of  information;  numerous  parsing  algorithms  are  presented,  both  general  and  restricted, 
along  with  much  of  the  theory  underlying  them. 

Some  early  general  parsing  algorithms  are  compared  by  Griffiths  and  Petrick  [22].  All 
of  the  algorithms  surveyed  rely  on  backtracking,  so  they  run  in  0(c")  time  in  the  worsUcase 
( n is  the  length  of  the  input  string). 

Although  it  is  restricted  to  Chomsky  Normal  Form  grammars,  the  Cocke-Younger- 
Kasami  algorithm  [6,19,46]  is  regarded  as  the  first  general  parser  to  run  in  polynomial  time 
(0(n3)).  The  nXn  parse  matrix  that  the  algorithm  constructs  accounts  for  an  0(n2)  space 
complexity.  Recall  that  the  matrix  entries  are  filled  with  sets  of  nonterminal  symbols. 
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A version  of  the  Cocke- Younger-Kasami  algorithm  that  is  restricted  to  unambiguous 
grammars  is  presented  by  Ivasami  and  Torii  [25].  The  time  and  space  bounds  of  this  algo- 
rithm are  both  O(n1 2logn).  Another  version  which  employs  linked  lists  in  place  of  the  parse 
matrix  is  described  by  Manacher  [32].  This  alternate  storage  discipline  allows  unambiguous 
grammars  to  be  recognized  in  quadratic  time,  a marked  improvement  over  the  corresponding 
cubic  bound  of  the  original  algorithm. 

The  Cocke-Younger-Kasami  algorithm  was  reduced  to  matrix  multiplication  by  Valiant 
[44].  Using  this  result,  Strassen’s  technique  for  multiplying  matrices  [l]  is  applied  to  obtain 
an  asymptotic  worst-case  time  complexity  of  0(n2'81)  for  general  recognition.1  Due  to  the 
overhead  associated  with  this  method,  it  is  primarily  of  theoretical  interest  only. 

In  contrast  to  the  Cocke-Younger-Kasami  algorithm,  Earley’s  algorithm  [6,13,14]  can 
process  any  grammar.  Like  LR  parsers,  Earley’s  algorithm  is  based  on  sets  of  items. 
Although  its  worst-case  time  and  space  bounds  are  also  0(n3)  and  0(n2),  respectively,  it 
performs  significantly  better  on  large  classes  of  grammars.  Specifically,  unambiguous  gram- 
mars are  parsed  in  0(n2)  time,  and  only  0(n ) time  is  needed  to  parse  LR(fc)  grammars  pro- 
vided that  A:-symbol  lookahead  is  used  in  the  latter  case.  Earley’s  algorithm  is  examined 
further  in  later  chapters. 

Efficiency  improvements  that  may  be  gained  by  employing  LL-  and  LR-like  lookahead2 
in  Earley’s  algorithm  are  reported  by  Bouckaert  et  al.  [9].  They  concluded  that  FIRST  sets 
are  more  useful  than  FOLLOW  sets  for  reducing  the  number  of  superfluous  items  generated 
during  recognition.  In  short,  FIRST  (resp.  FOLLOW)  information  reduces  the  number  of 
items  generated  by  Earley’s  Predictor  (resp.  Completer)  operation.  See  Christopher  et  al. 
[10]  for  an  example  of  an  application  of  Earley’s  algorithm;  specifically,  it  is  used  to  generate 
optimized  code  in  a Graham-Glanville  style  code  generator  [17],  If  desired,  Earley’s  algo- 
rithm may  be  extended  to  include  error  recovery  [3,31], 


1 Even  faster  techniques  for  matrix  multiplication  have  been  developed  since. 

2 That  is,  FIRST  and  FOLLOW  sets,  respectively 
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An  algorithm  that  is  a hybrid  of  the  Cocke-Younger-Kasami  and  Earley  algorithms  is 
described  by  Graham  et  al.  [19,20].  This  algorithm  also  accommodates  arbitrary  grammars. 
Like  the  Cocke-Younger-Kasami  algorithm,  an  nXn  parse  matrix  is  constructed.  However, 
the  matrix  positions  are  filled  with  sets  of  LR  items  instead  of  sets  of  nonterminals.  Practi- 
cal issues  are  discussed  in  detail  and  claims  are  made  that  more  efficient  implementations  are 
attainable  than  are  allowed  by  Earley’s  algorithm.  Sub-cubic  versions  based  on  matrix  mul- 
tiplication techniques  are  also  described. 

The  class  of  LR(fc)  grammars  was  introduced  by  Knuth  in  the  seminal  paper  on  LR 
parsing  theory  [27].  Knuth  described  a method  for  constructing  a deterministic  parser  for  an 
LR(Ar)  grammar,  observed  that  the  set  of  viable  prefixes  of  an  arbitrary  grammar  is  a regular 
language,  and  proved  that  it  is  undecidable  whether  an  arbitrary  grammar  is  LR(A:)  for  free 
k >0.  The  discovery  of  LR(&)  grammars  was  quite  significant  in  light  of  their  relationship  to 
deterministic  context-free  languages  [16]. 

Knuth’s  technique  for  parser  construction  is  generally  deemed  impractical  due  to  the 
enormous  number  of  parse  states  that  can  result.  The  SLR(A:)  [12]  and  LALR(fc)  [11,29] 
grammars  define  two  important  subclasses  of  the  LR(Ar)  grammars  which  allow  this  problem 
to  be  addressed  satisfactorily.  Relatively  compact  LR  parsers  for  grammars  in  these  sub- 
classes can  be  constructed  efficiently. 

Tomita’s  algorithm  [42,43]  extends  the  conventional  LR  parsing  algorithm  to  use  parse 
tables  that  contain  multiply-defined  entries.  Conflicting  parse  actions  are  handled  by 
employing  a graph-structured  stack  to  keep  track  of  the  different  parse  histories.  However, 
some  grammars  cause  the  stack  to  grow  without  bound  in  instances  where  no  input  is  con- 
sumed, so  the  algorithm  is  not  general.  Tomita’s  algorithm  is  discussed  in  greater  detail 
later. 

The  application  of  Tomita’s  algorithm  to  a system  which  supports  the  incremental  gen- 
eration of  parsers  is  reported  by  Heering  et  al.  [23],  Specifically,  Tomita’s  algorithm  is 
adapted  to  work  with  an  incrementally  generated  LR(0)  automaton.  The  states  of  the  auto- 
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maton  are  created  based  on  need.  Moreover,  the  system  accommodates  extensible  grammars 
whereby  changes  in  the  grammar  during  parsing  produce  corresponding  changes  in  the 
relevant  portions  of  the  automaton. 

Work  which  is  similar  in  spirit  to  ours  is  that  of  Mayer  [33];  deterministic  canonical 
bottom-up  parsing  is  examined  in  terms  of  reduction  classes  where  a reduction  class  is  a pair 
of  strings,  the  first  and  second  components  of  which  represent  the  left-  and  right^contexts, 
respectively,  of  parsing  actions.  Conditions  are  imposed  on  these  reduction  classes  which 
ensure  determinism,  termination,  and  correctness.  In  short,  the  cited  paper  presents  a 
framework  for  describing  deterministic  canonical  bottom-up  parsers,  whereas  our  aim  is  a 
framework  for  characterizing  general  recognition  and  parsing. 

Outline  in  Brief 

This  introductory  chapter  ends  with  a very  short  synopsis  of  the  remaining  chapters. 
The  next  chapter  reviews  some  basic  definitions  and  terminology.  Chapters  III  through  VII 
comprise  the  main  body  of  this  dissertation.  Concluding  remarks  are  made  in  Chapter  VDI. 

Chapters  III  and  IV  develop  the  mathematical  foundation  for  this  work.  Set-theoretic 
characterizations  of  general  top-down  recognition  and  general  bottom-up  recognition  are 
presented  in  those  two  chapters. 

Earley’s  algorithm  is  the  subject  of  the  fifth  chapter.  In  particular,  our  graphical  vari- 
ant of  Earley’s  algorithm  is  presented  there. 

A general  automata-based  bottom-up  recognizer  is  described  in  detail  in  Chapter  VI. 
Chapter  VII  extends  this  recognizer  into  a general  parser. 

The  major  results  of  this  dissertation  are  summarized  in  Chapter  VIII.  In  addition, 
directions  for  future  research,  of  which  there  are  several,  are  delineated  in  that  final  chapter. 


CHAPTER  II 

NOTATION  AND  TERMINOLOGY 


This  chapter  summarizes  some  of  the  elementary  formal  aspects  of  this  work,  viz., 
assorted  mathematical  notation  and  definitions.  In  particular,  some  basic  concepts  of  formal 
languages,  directed  graphs,  and  finite-state  automata  are  reviewed.  A more  comprehensive 
presentation  of  the  relevant  theory  can  be  found  in  the  monograph  by  Sippu  and  Soisalon- 
Soininen  [39], 

Elements  of  Formal  Language  Theory 

An  alphabet,  denoted  in  this  section  by  E,  is  a finite  set  of  symbols.  A string  over  E is 
a finite  sequence  of  elements  from  E the  null  string  corresponds  to  the  empty  sequence  and 
is  denoted  by  e.  A [format)  language  over  E is  a set  of  strings  over  E ; the  set  of  all  strings 
over  E is  denoted  by  E*  and  E f = Z*\{e}. 

The  length  of  a string  is  the  number  of  symbols  that  it  contains.  The  length  of  a string 
xGX*  is  denoted  by  len(x)  where  len  is  defined  recursively  as  follows:  len(e)  = 0;  Va  EE, 
len(a)  = 1;  Vx,  y GX*,  len(xy)  = len(x)  + len(y). 

The  previous  definition  used  the  notion  of  string  concatenation,  viz.,  xy.  Concatena- 
tion is  generalized  to  apply  to  languages  as  follows.  Given  two  languages  L and  V and  a 
string  x,  LL'  = { yz  \ y GL,  z GL'},  xL  = {x}L,  and  Lx  = L{x}.  The  identity  and  zero  of  con- 
catenation are  e and  0 (the  empty  set),  respectively.  Thus,  with  x denoting  either  a string 
or  a language,  xe  = ex  = x and  x0  = 0x  = 0. 

Let  L be  a language  and  i a natural  number.  The  ith  power  of  L,  L1,  is  defined  recur- 
sively by  L°  = {e}  and  L,+1  = LL1.  The  positive  closure  of  L and  the  Kleene  closure  of  L are 
defined  by  L+=  U L‘  and  L*  = U L1  = L+U{e},  respectively. 

i > 0 i >0 
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Let  x,  y,  and  z be  arbitrary  strings  over  E and  let  w —xyz.  Then  x is  a prefix  of  w,  y 
is  a substring  of  w,  and  z is  a suffix  of  w.  If  0 < len(z)  < len(tn)  holds,  then  a:  is  a proper 
prefix  of  w ; similarly,  if  0 < len(.j)  < len(io)  holds,  then  z is  a proper  suffix  of  w.  We  define 
PREFEX(x)  = {y  £ E*  \ x = yz  for  some  zEE*}  and  SUFFEX(i)  — {z  £ E*  \ x = yz  for  some 
y £Z*}.  If  A;  is  a natural  number,  then  k\x  (resp.  x:k)  denotes  the  unique  prefix  (resp.  suffix) 
of  x of  length  min{len(x),  k}.  This  notation  is  extended  to  languages  as  follows.  For  LCX*, 
PREFIX(L)  = U PREFIX(x),  SUFFIX(L)  = U SUFFIX(:r),  k:L  = {k:x  | x £L},  and  L:A; 

z€L  i6L 

= {x:k  | x £L}. 

The  reversal  of  a string  x EE*,  denoted  by  xR , is  defined  recursively  as  follows:  eR  = 
e;  Va  EE,  aR  = a;  Vx,yEE*,  (xy)R  — yR xR . Similarly,  the  reversal  of  a language  L is 
defined  by  = {xfi  | x £L}. 

Context-Free  Grammars  and  Languages 

A ( context-free ) grammar  is  denoted  by  G — ( V,T,P,S ) where  V is  an  alphabet 
known  as  the  vocabulary  of  G,  T ELV  and  N—V\T  are  the  terminal  and  nonterminal 
alphabets,  respectively,  P CNxF*  is  the  finite  set  of  productions,  and  SEN  is  the  start 
symbol.  The  following  conventions  are  generally  adhered  to:  a,b , c ,t  ET;  w,x,y,zET*; 
A,B,  C,S  EN ; X,Y,ZE  V.  In  addition,  lower-case  Greek  letters  denote  strings  in  V*.  An 
arbitrary  grammar  G is  assumed  throughout  the  rest  of  this  section. 

A production  (A,oJ)EP  is  written  A—rur,  A and  u>  are  the  left-hand  side  and  right- 
hand  side  of  the  production,  respectively.  A group  of  productions  that  share  the  same  left- 
hand  side,  viz.,  A— KjJv  A-*-oj2,...,  A— nJn,  n>l,  may  be  abbreviated  as 
yl  — >■  c^i  | a;2  | ' ' ' | Si-  A production  with  a right-hand  side  of  e is  called  a null  production 
or  e-production. 

It  is  common  to  specify  a grammar  by  listing  only  its  productions.  In  this  case,  the 
left-hand  side  of  the  first  production  or  production  group  in  the  list  is  taken  to  be  the  start 
symbol.  The  nonterminal  and  terminal  alphabets  can  be  inferred  from  the  productions. 


10 


If  A — >u>  is  a production  in  P,  then  A—ra»/3  is  an  item  of  <7  for  each  ev  and  /?  such 
that  oj=a/3.  The  size  of  <7  is  defined  as  |(7|  = ^'{len(AGj)  \A— *-ojEP}.  Note  that  the  size 
of  G is  equivalent  to  | {-4— ► o • /?  | A— ra»P  is  an  item  of  (7}|.  The  reversal  of  G is  the 
grammar  GR  = ( V , T ,PR ,S)  where  PR  = {A— nJ*  | A— KjjEP}. 

The  derives  relation  (=*),  a binary  relation  induced  on  V"*  by  P,  is  defined  formally  by 
=>  = {(aA  /3,  aaj/3)  | a,0EV*,  A—tojEP}.  A string  ^EV*  such  that  S =>*7  holds1  in  G is 
called  a sentential  form  of  (7;  the  set  of  the  sentential  forms  of  G is  denoted  by  SF(<7).  The 
(context-free)  language  that  is  generated  by  G is  defined  by  L(G)  = SF((7)nT*  Each 
member  of  L(G)  is  called  a sentence  of  G.  We  use  PREFDC(G)  and  SUFFIX(G')  as  abbrevi- 
ations for  PREFK(L(<7))  and  SUFFIX(L((7)),  respectively. 

For  A E N and  X E V,  if  A =*+aX0  holds  in  G for  some  ot,fiE  V * then  X is  reachable 
from  A.  A symbol  X E V is  nullable  if  X=**e  holds  in  G.  A string  ")EV*  is  nullable  if 
every  symbol  in  7 is  nullable.  In  particular,  e is  trivially  nullable. 

A symbol  X E V is  useful  if  either  X=S  or  S=^*aX(3=$*w  holds  in  G for  some 
cr,  PE  V*  and  w E T*\  otherwise,  X is  useless.  A grammar  is  reduced  if  every  symbol  in  its 
vocabulary  is  useful.  An  arbitrary  grammar  G can  be  transformed  into  an  equivalent 
reduced  grammar2  in  0(|(7|)  time  [39],  In  light  of  this  result  and  for  the  convenience  that  it 
provides,  all  grammars  are  assumed  to  be  reduced  throughout  this  work. 

A grammar  <7  is  $- augmented  if,  for  distinguished  symbols  S'  and  $,  P contains  a pro- 
duction of  the  form  S'— ►£$  where  S'EV  is  the  (new)  start  symbol  and  $ET  is  a sentence 
end-marker.  Moreover,  S'— *S$  is  the  only  production  in  which  S'  and  $ occur.  Whenever 
we  are  working  with  a $-augmented  grammar,  all  input  strings  are  assumed  to  end  with  $. 


1 The  transitive  (resp.  reflexive-transitive)  closure  of  a binary  relation  is  denoted  by  +(resp.  *). 

2 For  our  purposes,  two  grammars  are  equivalent  if  they  generate  the  same  language. 
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State-Transition  Graphs  and  Finite-State  Automata 

A state-transition  graph  (STG)  is  denoted  by  G = (Q,  E,S)  where  Q is  a finite  set  of 
states,  E is  an  alphabet,  and  SC(Q  xril{e})  X Q is  the  transition  relation.3  Thus,  an  STG 
differs  from  a finite-state  automaton  only  in  that  it  does  not  have  a start  state  or  a set  of 
final  states  designated  for  it.  A member  ((p,  a),  q)G6  is  read  as  a transition  from  p to  q on 
a;  p is  the  source  of  the  transition  and  q is  the  target.  A member  ((p , a),  q)G6  is  also  writ- 
ten as  ( p,a,q)GS  or  ?€<5(p,a);  the  latter  may  be  written  as  q=6(p,a)  if 

(p,  a,  g),(p,  a,  r)G<5  implies  that  q=r.  A transition  on  ( is  known  as  an  e-transition.  An 
STG  is  (.-free  if  it  has  no  e-transitions.  For  the  remainder  of  this  section  we  assume  an  arbi- 
trary STG  G = ( Q,E,6 ). 

The  following  property  holds  for  all  STGs  that  arise  in  this  work.  If 

(p , a , q),(p , b , r)G6  and  a^b,  then  q^r]  in  words,  distinct  transitions  which  share  the 
same  source  state  access  distinct  target  states.  Thus,  for  any  pair  of  states  p ,q  GQ,  there  is 
at  most  one  transition  from  p to  q . 

A path  in  G and  the  string  over  E that  it  spells  are  defined  inductively  as  follows.  For 
each  state  qGQ,  (q)  denotes  a path  in  G from  q to  q spelling  e;  for  m>  1 and  i q^GQ, 
0 <i  < m,  if  ( q0 , qv  ...  , q^^  denotes  a path  in  G from  q0  to  qm_1  spelling  x GE*  and 
( qm_i , a,  qm)G6,  then  ( q0 , qv  ...  , qm)  denotes  a path  in  G from  q0  to  qm  spelling  xa.  The 
length  of  a path  is  the  number  of  transitions  that  it  contains.  A state  q is  reachable  from  a 
state  p if  and  only  if  there  exists  a path  in  G from  p to  q. 

The  succ  function,  succ:$  XT*— ► 2^,  is  defined  by  succ(p,  x)  = {g  GQ  \ 3 a path  in  G 
from  p to  q spelling  x}.  Extending  this  function  to  RC.Q,  succ(R,x)  = U succ(<7,:r). 

q(.R 

The  pred  function,  pred:Q  XT*— *2^,  is  defined  in  terms  of  succ  by  pred(<7,i)  = 

(p  G Q | q €succ(p,  x)}  and  is  similarly  extended  to  subsets  of  Q. 


3 A subscript  is  given  to  G later  to  differentiate  it  from  a grammar 
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The  inverse  of  G is  denoted  by  G~l  = (Q,£,  <5_1)  where  (p , a , q)(z6~l  if  and  only  if 
(<?,  a,p)E<5,  i.e.,  the  transitions  of  G are  reversed  in  G~l. 

A finite-state  automaton  (FSA)  is  denoted  by  M — (G,  q0,  F)  = ( Q , E,  6,  q0,F)  where  G 
= (Q,E,  6)  is  an  STG,  q0£Q  is  the  start  state,  and  F C.Q  is  the  set  of  final  states.  Each 
state  in  Q is  assumed  to  be  reachable  from  q0.  If  G is  e-free,  then  M is  also  e-free.  If  M is 
e-free  and  (p , a , q),(p , a , r)£<5  implies  that  q—r,  then  M is  deterministic.  An  arbitrary 
(resp.  deterministic)  FSA  is  called  an  NFA  (resp.  DFA).  The  ( regular ) language  accepted  by 
M is  defined  by  L (M)  = {2  EX*  | succ(70l  z)n.F  ^0}.  A state  q E Q is  dead  if  no  final  state 
is  reachable  from  it. 


CHAPTER  III 

GENERAL  TOP-DOWN  RECOGNITION:  A FORMAL  FRAMEWORK 

A formal  framework  for  describing  general  top-down  recognition  is  developed  in  this 
chapter.  Two  contrasting  top-down  recognition  schemes  are  presented;  they  are  dis- 
tinguished by  the  direction  in  which  the  input  string  is  scanned,  viz.,  right-to-left  or  left-to- 
right.  Since  the  two  schemes  turn  out  to  be  mirror  images,  one  is  derived  in  terms  of  the 
other.  Our  approach  to  general  recognition  is  based  on  certain  regularity  properties  of 
context-free  grammars.  Consequently,  the  framework  is  designed  accordingly  to  highlight 
these  properties. 

The  primary  purpose  of  this  chapter  is  to  catalog  some  formal  aspects  of  general  top- 
down  recognition.  An  investigation  of  the  practical  utility  of  the  two  general  top-down 
recognition  schemes  is  left  for  future  work.  However,  the  theoretical  development  contained 
herein  is  invaluable  toward  deriving  a practical,  truly  general,  bottom-up  parser;  that  is  the 
thrust  of  the  remaining  chapters.  An  arbitrary  reduced  grammar  G = (V,  T,  P,  S)  is 
assumed  throughout  this  chapter. 

Recognition  Based  on  Derivations 

In  a top-down  approach  to  recognition,  an  attempt  is  made  to  construct  a parse  tree  for 
an  input  string,  perhaps  implicitly,  by  starting  at  the  root  and  progressing  toward  the  leaves. 
The  downward  growth  of  an  incomplete  parse  tree  occurs  at  the  frontier  of  the  tree  which 
may  be  represented  by  the  string  of  grammar  symbols  which  label  its  nodes.  A basic  step  in 
constructing  the  parse  tree  involves  applying  the  =>  relation  to  this  linearized  form  of  the 
frontier.  However,  the  derives  relation  is  too  undisciplined,  in  general,  for  describing  top- 
down  recognition  in  a useful  fashion  since  there  is  no  indication  of  which  nonterminal  symbol 
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to  replace  at  each  step.  Instead,  rightmost  and  leftmost  derivations  are  preferred  for  the 
additional  constraints  that  they  place  on  the  parse  tree  construction  process. 

Since  rightmost  and  leftmost  derivations  are  defined  in  terms  of  subrelations  of  the  =4 
relation,  they  also  construct  parse  trees  top-down.  In  addition,  they  impose  a canonical1 
order  on  the  construction  of  parse  trees.  Specifically,  rightmost  derivations  construct  parse 
trees  from  right  to  left,  whereas  leftmost  derivations  construct  them  from  left  to  right.  Some 
basic  notions  about  rightmost  and  leftmost  derivations  are  briefly  reviewed  next. 

Rightmost  and  leftmost  derivations  are  based  on  the  r-derives  (=}, ) and  l-derives  (=»/ ) 
relations,  respectively.  These  relations  are  formally  defined  by  =»,  = {(aAz,  au>z)  \ otEV*, 
A—ruEP , zET*}  and  =>/  = {(xA  0,  xoj0)  \ xET*,  A—*u>EP,  0EV*}.  Rightmost  deriva- 
tions (resp.  leftmost  derivations)  are  defined  in  terms  of  the  reflexive-transitive  closure  of  =>, 
(resp.  =*i ) in  the  usual  fashion. 

For  76  V*,  if  S =**7  holds  in  G,  then  7 is  called  a right  sentential  form  of  G . The  set 
of  the  right  sentential  forms  of  G is  denoted  by  SFr(G).  The  inclusion  SFr(G)CSF(G) 
holds  and  is  typically,  but  not  always,  proper.  In  contrast,  for  w ET*,  S =**w  holds  in  G if 
and  only  if  S =t?w  holds  in  G.  Thus,  L(G)  = {w  E T*\  S=*?w  holds  in  G}. 

For  A EN  and  XEV,  if  A=*?aX  holds  in  G for  some  nGF*,  then  X is  right- 
reachable  from  A ; furthermore,  if  X—A,  then  A is  right-recursive.  A grammar  that  has  a 
right-recursive  nonterminal  is  a right-recursive  grammar.  A symbol  XE  V is  nullable  in  G if 
and  only  if  X =>*e  holds  in  G. 

Any  string  7 EV*  such  that  S =**7  holds  in  G is  a left  sentential  form  of  G.  The  set  of 
the  left  sentential  forms  of  G is  denoted  by  SF^G).  Similar  to  the  above,  the  relationship 
SF/(G)CSF(G)  holds  and  is  generally  proper.  In  addition,  L(G)  = {w  E T*\  S =**w  holds  in 

C}- 

Given  AEN  and  XEV,  if  A =^0X0  holds  in  G for  some  0EV*,  then  X is  left- 
reachable  from  A;  if  it  further  holds  that  X=A,  then  A is  left-recursive.  A grammar  is 


1 In  the  literature,  the  term  "canonical”  is  typically  associated  with  rightmost  derivations  only. 
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left-recursive  if  at  least  one  of  its  nonterminals  is  left-recursive.  Finally,  X E V is  nullable  in 
G if  and  only  if  X holds  in  G. 

Top-Down  Right-to-Left  Recognition 

A general  top-down  recognition  scheme  that  scans  the  input  string  from  right  to  left  is 
formally  developed  next.2  This  scheme  is  based  on  two  binary  relations  on  V*.  Through 
these  two  fundamental  relations,  a set-theoretic  characterization  of  general  top-down  right- 
to-left  recognition  which  succinctly  captures  the  essence  of  the  task  is  derived. 

In  concert,  the  two  relations  refine  and  supplant  the  r-derives  relation.  Certain  regular- 
ity properties  of  context-free  grammars  that  are  central  to  our  treatment  of  recognition  are 
characterized  directly  and  rather  elegantly  by  the  two  relations;  by  comparison,  a description 
of  these  properties  in  terms  of  r-derives  is  indirect  and  somewhat  awkward.  It  is  in  this 
sense  that  the  two  relations  refine  the  r-derives  relation.  Moreover,  the  two  relations  provide 
alternate  definitions  of  the  right  sentential  forms  and  sentences  of  a grammar.  In  that 
respect,  the  r-derives  relation  is  supplanted  by  them. 

Strong  Rightmost  Derivations 

The  strong  rightmost  derives  relation  (=*/?)  is  defined  by  =^n  = {(aA,a,a;)|  aEV*, 
A—ruEP}.  Thus,  =*/?  is  a subrelation  of  =*,  with  domain  V*N.  For  brevity,  the  strong 
rightmost  derives  relation  is  called  the  R-derives  relation. 

Strong  rightmost  derivations  are  defined  in  terms  of  the  reflexive-transitive  closure  of 
=$•«  . Thus,  every  strong  rightmost  derivation  is  also  a rightmost  derivation.  The  following 
series  of  lemmas  compares  some  elementary  properties  of  rightmost  and  strong  rightmost 
derivations. 

Lemma  3.1  For  a,  /?£  V™,  if  <*=*/?  / 3 holds  in  G,  then  a=^,* /?  holds  in  G. 

Proof.  This  follows  directly  from  the  fact  that  is  a subrelation  of  =>,  . O 

2 For  the  moment,  we  ignore  the  fact  that  a rightrto-left  scan  of  the  input  is  not  particularly  useful 
in  practice. 
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Lemma  3.2  For  a,P&V*  and  A £/V,  if  a=$?/3A  holds  in  G,  then  a=^8PA  holds  in 
G. 

Proof.  Let  n represent  the  length  of  a rightmost  derivation  of  pA  from  a.  By  induction  on 
n,  we  show  that  there  exists  an  identical  n-step  strong  rightmost  derivation  of  pA  from  a. 
Basis  (n  =0).  Assume  that  ot=$?  PA  holds  in  G.  This  implies  that  a=PA,  since  =^,°  is 
equivalent  to  the  identity  relation  on  V*.  Since  =>p  is  also  equivalent  to  the  identity  rela- 
tion on  V*,  a =*}i  a also  holds  in  G . 

Induction  (n  > 0).  By  assumption,  ot=$?  pA  holds  in  G.  The  last  step  in  a particular  n-step 
derivation  of  /3A  from  a can  take  two  distinct  forms.  These  are  analyzed  in  the  following 
two  cases. 

Case  (i):  Of=>,n-17 B=^,^bA=PA  for  some  -yCF*  and  B—^bA&P.  By  the  induction 
hypothesis,  a=^8~l^B  holds  in  G.  Since  7F=*r7&4  holds  in  G by  definition,  we  conclude 
that  a=*R  /3A  holds  in  G. 

Case  (ii):  a:=>"-1 0AB  =►,  fiA  for  some  B—re&P.  By  the  induction  hypothesis,  a=>a_1  PAB 
holds  in  G.  Thus,  a =*8  PA  also  holds  in  G since  PAB  =*/?  PA  holds. 

In  both  cases,  we  have  shown  that  a=*8  PA  holds  in  G . □ 

Lemma  3.3  For  a(zV*  and  a € T,  if  a=4,*/?a  holds  in  G for  some  /JGV"*,  then 
a=tft7a  holds  in  G for  some  76  V*. 

Proof.  Assume  that  a=$?pa  holds  in  G for  some  P&V*.  If  01=70  for  some  76V"*,  then 
o=*j?7a  =0  trivially  holds  in  G.  Otherwise,  suppose  that  a does  not  end  with  a.  In  this 
case,  every  rightmost  derivation  of  Pa  from  o is  nontrivial.  We  analyze  one  such  rightmost 
derivation  and  focus  on  the  step  that  causes  a to  become  the  rightmost  symbol  in  a string 
occurring  in  that  derivation.  The  initial  segment  of  the  derivation  up  to  and  including  this 
step  can  take  two  distinct  forms. 

Case  (i):  a=*?8A  =*,  boa  for  some  b&  V*  and  A— raa  £F.  By  Lemma  3.2,  a=^8bA  holds  in 
G.  By  definition,  6A  =*k  baa  holds  in  G.  Thus,  a=^8^a  holds  in  G when  we  let  7 =ba. 
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Case  (ii):  a=^*6aA  =>,  6a  for  some  6EV*  and  A—reEP.  Similar  to  Case  (i),  a^pSaA  and 
6aA  =>r  6a  both  hold  in  G.  Now  we  let  7 =6  to  conclude  that  a =**70  holds  in  G 
We  have  demonstrated  in  both  cases  that  a;=*j?7a  holds  in  G for  some  76  V*.  □ 

Lemma  3.4  For  A EN  and  X E V,  X is  right-reachable  from  A in  G if  and  only  if 
A holds  in  G for  some  aEV*. 

Proof.  If  X is  right-reachable  from  A in  G,  then  A =*?  0X  holds  in  G for  some  0EV*.  If 
XEN,  then  A =>r f3X  also  holds  in  G by  Lemma  3.2.  If  XET,  then  Lemma  3.3  applies, 
i.e. , A =*rQ:X  holds  in  G for  some  Conversely,  suppose  that  A =>hoX  holds  in  G 

for  some  aE  V*.  It  follows  directly  from  Lemma  3.1  that  X is  right-reachable  from  A . □ 

Corollary  For  A EN,  A is  right-recursive  in  G if  and  only  if  A =*RaA  holds  in  G for 
some  a E V*.  □ 

Lemma  3.5  For  A"  E V,  X is  nullable  in  G if  and  only  if  X =*r  £ holds  in  G. 

Proof.  If  XET,  X is  not  nullable  in  G and  X =*r  £ does  not  hold  in  G.  Now  suppose  that 
XEN.  If  X is  nullable  in  G,  then  every  rightmost  derivation  which  demonstrates  this  must 
be  of  the  form  A”  =4,* A =►,  £ for  some  A—reEP.  From  Lemma  3.2  and  the  fact  that  A =*r  e 
holds  in  G,  we  conclude  that  X =>*  £ holds  in  G.  Conversely,  X=*r(  immediately  implies 
that  X is  nullable  in  G since  =*/?  is  a subrelation  of  =>,  . □ 

Corollary  For  7 E V*,  7 is  nullable  in  G if  and  only  if  7 =>«  e holds  in  G.  □ 

One  final  lemma  is  presented  before  introducing  the  companion  relation  to  =►«  . The 
lemma  is  useful  for  motivating  this  second  relation. 

Lemma  3.6  For  aGT,  at  least  one  of  the  following  two  statements  is  true:  (1) 
a=*R  fta  holds  in  G for  some  /HE  V*  and  a E T\  (2)  «=>«  £ holds  in  G. 

Proof.  If  a—e,  then  statement  (2)  holds  trivially.  Now  suppose  that  a#e.  Since  G is 
reduced,  a=*?x  holds  in  G for  some  xET*.  If  x =e,  then  statement  (2)  again  holds  from 
the  corollary  to  Lemma  3.5.  Otherwise,  x =ya  for  some  y ET*  and  a ET . By  Lemma  3.3, 
it  now  follows  that  ar=»j?  /3a  holds  in  G for  some  0EV*.  □ 
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Lemma  3.3,  in  contrast  to  Lemma  3.2,  illustrates  that  a rightmost  derivation  departs 
from  a strong  rightmost  derivation  following  the  step  where  a terminal  symbol  first  appears 
at  the  right  end  of  a string  occurring  in  the  rightmost  derivation.  The  role  of  the  second 
relation  that  we  introduce  is  to  dispense  with  terminal  symbols  as  they  appear  at  the  right 
end  of  strings  in  strong  rightmost  derivations.  Specifically,  the  chop  relation  is  defined  by  I 
= {(aa , a)  | aG  V*,  aGT}.  For  every  aGT,  la  denotes  the  subrelation  of  I with  domain 
V*a . Thus,  for  a,  /9G  V*  and  a € T,  a la  P holds  if  and  only  if  a I/?  and  hold. 

The  relation  product  =*r  I,  a useful  composition  that  is  suggested  by  Lemma  3.6,  is 
used  extensively  in  what  follows.  Formally,  for  O',  /3EF*,  <*=*«  I P holds  in  G if  and  only  if 
a=*R  Pa  I p holds  in  G for  some  a G T\  this  latter  expression  is  usually  written  as  <*(=»/?  I)a  P. 

For  clarity,  we  describe  inductively  the  notation  that  we  will  employ  for  exploiting  the 
reflexive-transitive  closure  of  (=>r  I);  similar  conventions  are  applied  to  other  relation  pro- 
ducts that  are  introduced  later.  For  all  O'GV'*,  <*(=*«  l)t°a  holds  in  G\  for  a, /?,  7G  V™, 
y G Tn~1  with  n >1,  and  nGT,  if  a(=*H  I )y~1  P and  P(=$r  l)a  7 hold  in  G,  then  £*(=*«  l)aj/7 
holds  in  G . The  order  of  ay  in  the  latter  expression  reflects  the  fact  that  the  terminal  sym- 
bols of  a string  are  generated  by  =*r  and  chopped  by  I from  right  to  left.  Finally,  if 
ot(=*R\)yP  holds  in  G for  some  a,PEV*  and  y G Tn  with  n >0,  then  for  convenience  we 
may  instead  write  this  expression  as  a(=*«l)*/?,  a(=*«l)* p,  or  l)n /?  according  to 

whether  or  not  the  string  y or  its  length  n is  relevant. 

Right  Sentential  Forms  Revisited 

Next  we  investigate  how  arbitrary  rightmost  derivations  are  mimicked  by  the  and 
I relations.  In  short,  a rightmost  derivation  is  represented  as  a sequence  of  strong  rightmost 
derivations  interspersed  with  chops  of  terminal  symbols.  As  a result  of  this  analysis,  the  pre- 
cise manner  in  which  right  sentential  forms  and  sentences  are  generated  by  the  two  new  rela- 


tions is  revealed. 


19 


Lemma  3.7  For  a,pEV*,  if  cx=$rP  holds  in  G,  then  az=>*Pz  holds  in  G for  every 
zET*. 

Proof.  If  a=*H  P holds  in  G,  then  a =>,*/?  holds  in  G by  Lemma  3.1.  The  consequent  in  its 
full  generality  can  then  be  established  by  an  induction  on  the  length  of  an  arbitrary  string 
zET*.  □ 

Lemma  3.8  For  a,pEV*  and  zET*,  if  £*(=*«  I )*P  holds  in  G,  then  f3z  holds  in 
G. 

Proof.  The  proof  is  by  induction  on  n =len(2). 

Basis  ( n =0).  In  this  case,  z =e.  By  assumption,  <*(=*«  l)£° p holds  in  G.  It  must  then  be  the 
case  that  ot=P,  so  p trivially  holds  in  G . 

Induction  (n>0).  In  this  case,  z=ay  for  some  aET  and  yETn~l.  Assume  that 
cx(=$R  l)"y  p holds  in  G.  Then  a(=*R  1)^ _17(=4j?  l)a  P holds  in  G for  some  7 6F*.  By  the 
induction  hypothesis,  a=*,*7t/  holds  in  G.  Furthermore,  7(=*fi  l)a  P implies  that  7=>«  Pa  I P 
holds  in  G.  By  Lemma  3.7,  71 / =>,* Pay  holds  in  G,  so  a=*?Pay  =pz  also  holds  in  G.  □ 

Lemma  3.9  For  a,  PEV*  and  2 E T*,  if  <*(=►*  l)*=id?  P holds  in  G,  then  pz  holds 

in  G. 

Proof.  By  assumption,  l)*=^j *P  holds  in  G . This  implies  that  a(=>R  l)*7=>j?  p holds  in 

G for  some  7 E V*.  By  Lemma  3.8,  0=^*7 z holds  in  G . Since  7 =$r  p holds  in  G,  ^z=>IPz 
holds  in  G by  Lemma  3.7.  Therefore,  a=$?  Pz  holds  in  G.  □ 

Lemma  3.10  For  a,  P,  7E  V*  and  x,  y E T*,  if  a?(=*fl  I)*  =$r  p and  P(=*r  I)*  =*r  7 hold  in 
G,  then  a^^j?  1)*^  =*r  7 holds  in  G . 

Proof.  The  key  observation  relevant  here  is  that  the  expression  a(=*R  I)*  =*r  P (=*r\)*=*r1 
may  be  rewritten  as  a (=>£  I)  * (=*«  I)  * =*r  7;  to  make  this  transformation,  the  occurrence  of 
=*r  preceding  P in  the  first  expression  is  “absorbed”  by  (=*j?  I)*  if  x and  by  the 
occurrence  of  =*r  preceding  7 otherwise.  It  is  now  immediate  that  a (=*r  I) =*«  7 holds  in 


G.  □ 
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Lemma  3.11  For  a£  T*  and  z G T*,  otz  (=*r  I)*=}r  a holds  in  G. 

Proof.  This  is  shown  by  an  easy  induction  on  n =len(.j). 

Basis  (n  =0).  Trivially,  a(=>«  I )*=»«  a holds  in  G . 

Induction  ( n >0).  Let  z=ay  for  some  aGT  and  y £ F'1.  By  the  induction  hypothesis, 
a ay  (=4r  I)"-1  =>r  aa  holds  in  G.  Observing  that  a a =}Raa  1 aa=*f!a  holds  in  G establishes 
that  aa  (=*r  l)a  =*r  a also  holds.  It  now  follows  from  Lemma  3.10  that  aay  (=*r  l)*v  =*r  a 
holds  in  G.  □ 

Lemma  3.12  For  a,  /9GT*,  let  hold  in  G.  Furthermore,  let  /?=7 x for  some 

7G  T*  and  x G T*  where  7G  T*/V  if  /?G  V*NT*  and  7=e  otherwise  (i.e.,  x is  the  longest  suffix 
of  f}  consisting  solely  of  terminal  symbols).  Then  a (=*r  I)  *x  =*S  7 holds  in  G. 

Proof.  The  proof  is  by  induction  on  the  length  n of  a rightmost  derivation  of  /?  from  a. 

Basis  (n  =0).  Thus,  a=*r°/?=a.  Write  a as  71  for  some  7G  T * and  x G T*  where  x is  the 
longest  suffix  of  a contained  in  T*.  In  this  case,  a =72  (=>•«  !)*=>•/?  7 holds  in  G by  Lemma 
3.11. 

Induction  (n  >0).  A rightmost  derivation  of  /?  from  a consisting  of  n steps  is  of  the  form 
a=*,n-1  6. Az  =»,  6ojz  =fd  for  some  <5 G V*,  A—ruEP,  and  z G T*.  By  the  induction  hypothesis, 
holds  in  G.  Since  6A  =*/?  6cj  holds  in  G,  a(=$R  l)*=^j?  6oj  also  holds.  Now 
write  60;  as  72/  for  some  7GT*  and  y G T*  where  y is  the  longest  suffix  of  Soj  made  up 
entirely  of  terminal  symbols.  By  Lemma  3.11,  6u>=iy  (=**  l)„  =**7  holds  in  G.  It  then  fol- 
lows from  Lemma  3.10  that  a(=*j?  I)  *2  =>« 7 holds  in  G.  Finally,  we  note  that  /?=7yz  where, 
by  construction,  yz  is  the  longest  suffix  of  /?  that  is  comprised  of  only  terminal  symbols.  □ 
Theorem  3.13  SFr(G)  = {7GT*|  S (=>«!)*=>« a holds  in  G for  some  aGT  and 
zET*  such  that  7 =az}. 

Proof.  Suppose  that  S (=*r  I )*=>«  a holds  in  G for  some  aGT*  and  z G T*.  By  Lemma  3.9, 
S=*?az  also  holds  in  G,  so  azGSFr(G').  Conversely,  suppose  that  5=^*7  holds  in  G for 
some  7GT*.  Let  7 =az  for  some  aGT*  and  zET*  such  that  z is  the  longest  suffix  of  7 
which  is  a terminal  string.  Then  S (=*£  !)*=>«  a holds  in  G by  Lemma  3.12.  □ 
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Corollary  L(G)  ={wET*\  S (=*«  I)*  =»r  c holds  in  G}.  □ 

Corollary  SUFFEX(G)  = {zET*\  S (=4*  f)*a  holds  in  G for  some  aEK*}.  □ 

Viable  Prefixes 

A concept  that  plays  a central  role  in  LR  parsing  theory  is  that  of  a viable  prefix. 
Viable  prefixes  are  also  prominent  in  our  treatment  of  general  recognition  and  parsing. 
Viable  prefixes  are  defined  in  terms  of  rightmost  derivations  and  right  sentential  forms  as  fol- 
lows. A string  76  V*  is  a viable  prefix*  of  G if  S =4,* 6Az  =4,  6a/3z  = 7 fiz  holds  in  G for 
some  (56  V*,  A—rafi^P , and  z&T*.  Thus,  viable  prefixes  are  certain  prefixes  of  right  sen- 
tential forms.  The  set  of  viable  prefixes  of  G is  denoted  by  VP(G). 

In  the  next  series  of  lemmas,  a definition  of  the  viable  prefixes  of  G in  terms  of  the  R- 
derives  and  chop  relations  is  developed.  It  transpires  that  this  definition  is  remarkably  simi- 
lar to  the  definition  of  SFr(G)  just  given.  Since  viable  prefixes  are  defined  via  nontrivial 
rightmost  derivations  from  S,  our  definition  is  carefully  tailored  to  include  S in  VP(G)  only 
in  case  5 =4,+  Sa  holds  in  G for  some  o-G  V*. 

Lemma  3.14  For  a,/3EV*}  if  a=*R  /?  holds  in  G and  a is  a viable  prefix  of  G,  then  fi 
is  a viable  prefix  of  G . 

Proof.  Since  a=4/?  ft  holds  in  G by  assumption,  a—^A  and  fi=qoj  for  some  7GV*  and 
A-+u)(zP.  Also  by  assumption,  aGVP(G),  so  S =4* 6Bz  =>,  barz—arz  holds  in  G for  some 
6GV*,  B—rcrr&P , and  zET*.  Since  G is  reduced,  r=4,*t/  holds  in  G for  some  y G T*. 
Thus,  S =^*ayz  =^Ayz  =4,  'yojyz  =fiyz  holds  in  G which  shows  that  ft  is  a viable  prefix  of  G. 
□ 

Lemma  3.15  For  a,  /?G  V*,  if  a=>p  /3  holds  in  G and  a is  a viable  prefix  of  G,  then  fi 
is  a viable  prefix  of  G. 

Proof.  Applying  the  preceding  lemma,  this  lemma  is  established  by  an  easy  induction  on  the 
length  of  a strong  rightmost  derivation  of  ft  from  a.  □ 

3 This  definition  is  borrowed  from  Sippu  and  Soisalon-Soininen  [38],  Although  it  differs  slightly  from 
others  (cf.,  [5]),  it  is  more  appropriate  to  our  needs 
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Lemma  3.16  For  a,  /?G  V*,  if  a I /?  holds  in  G and  a is  a viable  prefix  of  G,  then  /?  is  a 
viable  prefix  of  G*. 

Proof.  From  the  hypothesis,  a=/3a  for  some  aG7\  Conventional  definitions  of  viable 
prefixes  [5]  prescribe  that  every  prefix  of  a viable  prefix  of  G is  also  a viable  prefix  of  G. 
However,  this  property  is  not  immediate  from  the  definition  that  we  have  adopted.  A proof 
that  this  property  does  hold  in  our  definition  is  provided  by  Sippu  and  Soisalon-Soininen  [38]. 
The  essence  of  their  argument  is  based  on  the  existence  of  a rightmost  derivation  of  the  form 
S =^?6Az  =*,  fcrarz  =f3arz  for  some  6EV*,  A—kjclt&P  , and  z G T*.  This  derivation  form 
demonstrates  that  both  /?a  =a  and  /?  are  viable  prefixes  of  G.  □ 

Lemma  3.17  For  7G  V*,  if  u>[=*r  I)*  7 holds  in  G for  some  S—+gj(EP  and  2 G T*,  then 
7 is  a viable  prefix  of  G . 

Proof.  The  proof  is  by  induction  on  n =len(2). 

Basis  ( n =0).  In  this  case,  z =e.  By  assumption,  u;(=*j?  I)f°7  holds  in  G for  some  S—ruEP. 
Then  7 must  equal  uj  which  is  a viable  prefix  of  G since  S =>,  w holds  in  G. 

Induction  (n  > 0).  In  this  case,  z=ay  for  some  aGT  and  y G Tn~l.  Assume  that 
u(=**r  0«"v7  holds  in  G.  Then  w(=>*  l)y"-1  /3(=>r  l)a  7 holds  in  G for  some  V*.  By  the 
induction  hypothesis,  /?GVP (G).  Now  /?(=}«  I)a  7 implies  that  /?=*j?7 a I 7 holds  in  G.  It  fol- 
lows from  Lemmas  3.15  and  3.16  that  7 a and  7 are  also  viable  prefixes  of  G.  □ 

Lemma  3.18  For  7G  V*,  if  qj(=*r  1)*=*/?  7 holds  in  G for  some  S—rcoEP  and  2G  T*, 
then  7 is  a viable  prefix  of  G. 

Proof.  Assume  that  u(=$r  l)*=*/?7  holds  in  G for  some  S—noEP  and  zET*.  This  implies 
that  oj(=*r  I) */?=}j?7  holds  in  G for  some  /2G  V*.  By  Lemma  3.17,  /?  is  a viable  prefix  of  G. 
Thus,  7 is  also  in  VP(G)  by  Lemma  3.15.  □ 

Lemma  3.19  For  7G  V*,  if  7 is  a viable  prefix  of  G,  then  cj(=}*  l)*=*/?7  holds  in  G for 
some  S—ruEP  and  2 G T*. 

Proof.  By  assumption,  7GVP(Gr).  Thus,  S =>* 6Ay  =*,  botfly  =7/fy  holds  in  G for  some 
6G  V*,  A— m/9GP,  and  t/GT*  From  the  proof  of  Lemma  3.12,  S (=>/?  I) * 7/9  holds  in  G. 
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Since  G is  reduced,  /3=$?x  holds  in  G for  some  x(zT*.  Therefore,  7/?=},* 72:  and 
7/?(=4a  I) x =>*  7 both  hold  in  G.  Combining  these  results  in  the  manner  of  Lemma  3.10, 
5 (=4«  I) * =4*  7 holds  in  G.  Since  the  nontrivial  rightmost  derivation  of  7 (3y  from  5 must 
have  a first  step  of  the  form  5=4,  u for  some  S—+gj£P,  oj(=*r  l)*=4*  7 holds  in  G where 
z—xy.  □ 

Theorem  3.20  VP(G')  = {7E  V*  \ uj(=^r  l)*=4«  7 holds  in  G for  some  5— ►cjEP  and 

z e ry 

Proof.  This  theorem  follows  directly  from  Lemmas  3.18  and  3.19.  □ 

Corollary  VP(G)  = {7G  L*  | 5 (=4*  U l)+7  holds  in  <?}.  □ 

One  final  observation  is  that  VP(G)  is  closed  under  (=4*  U I).  Indeed,  this  is  immediate 
from  Lemmas  3.14  and  3.16.  Due  to  its  importance  in  general  canonical  top-down  recogni- 
tion, this  property  is  formally  recorded  below. 

Corollary  For  a,  /?E  V*,  if  aEVP(G)  and  a(^R  Ul)*£  holds  in  G,  then  /?GVP (G). 

□ 

General  Top-Down  Correct-Suffix  Recognition 

Let  w E T*  be  an  arbitrary  input  string.  A top-down  scheme  for  recognizing  w with 
respect  to  G is  described  next.  In  this  scheme,  w is  scanned  from  right  to  left.  As  a conse- 
quence, an  incrementally  longer  suffix  of  w is  recognized  in  the  process. 

The  general  recognition  scheme  effectively  pursues  all  of  the  possible  rightmost  deriva- 
tions of  w in  parallel.  This  is  carried  out  through  regularity-preserving  operations  on  regular 
subsets  of  VP(G).  Adoption  of  this  approach  obviates  the  need  for  backtracking. 

General  context-free  recognition  is  an  inherently  nondeterministic  task.  Hence,  it  is  not 
generally  possible  to  pursue  the  rightmost  derivations  of  w exclusively.  Instead,  at  the  point 
where  a suffix  z of  w has  been  processed,  all  rightmost  derivations  (from  5)  of  all  strings  in 
T*z  nL(G')  are  followed  (i.e.,  all  sentences  that  have  2 as  a suffix). 
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The  essence  of  the  recognition  scheme,  called  GeneraLRR,  is  simple.  Let  z G T*  be  a 
suffix  of  xv  and  suppose  that  all  proper  suffixes  of  z are  known  members  of  SUFFIX(G').  The 
set  of  strings  defined  by  {a  GVP(G)  | S =*?az  holds  in  (7}  is  used  to  determine  if  z is  a 
member  of  SUFFBC(G).  This  set  is  nonempty  if  and  only  if  zGSUFFK(G).  Moreover,  it 
contains  e if  and  only  if  zGL(G).  The  General_RR  recognition  scheme  is  described  in 
greater  detail  in  what  follows.  For  reference,  the  recognizer  is  presented  as  Figure  3.1. 


function  General_RR((7  —(V,  T,P,S)]  wET*) 

//  w =a1a„  • • ■ an,  n >0,  each  a,  G T 
PVPrr(G,€)  :={w|S—u;GP} 
for  i :=  0 to  n— 1 do 

VPrr(<7,  w:i)  :=  =***  (PVPrr(G,  u>:i)) 
PVPrrJG.itii+I)  :=  la_  (VPr^G,  i»:«)) 
if  PVPrr((7,  ic:j+1)  = 0 then  Reject(iu)  fi 
od 

VPrr((7,  id)  :=  =*«(PVPrf(G,  w)) 

if  e G VPrr(G,  w ) then  Accept(u;)  else  Reject(ic)  fi 


Figure  3.1  — A General  Top-Down  Correct-Suffix  Recognizer 


For  an  arbitrary  string  zGT*  two  sets  of  viable  prefixes  are  identified  with  z.  The 
first  set  consists  of  the  primitive  RR-associates  of  z (in  G ) and  is  defined  by  PVPhr(<7,2)  = 
{aG  V*  | u)(=^k  l)*a  holds  in  G for  some  S— ►wGP}.  The  second  set  is  a superset  of  the  first; 
it  consists  of  the  RR-associates  of  z (in  G ) and  is  defined  by  VPrr((7,2)  = 
{aG  V*  | cu(=*j!  !)*=*/?  a holds  in  G for  some  5— ►cjGP}.  By  Theorems  3.13  and  3.20, 
VPrr(<7,z)  = {o,GVP(G!)  | S=*?az  holds  in  G } which  equates  to  the  set  described  in  the 
preceding  paragraph.  Input  string  w is  recognized  by  computing  PVPrr(G,  tv:i)  and 
VPrr(G,  w:i)  in  turn  as  i ranges  from  0 to  len(u>). 

In  words,  VPrr(<7,z)  is  the  reflexive-transitive  closure  of  PVPrr(<7,z)  under  the  =*j? 
relation.  This  fact  is  made  explicit  by  expressing  VPri^GjZ)  as  {/?G  F*|  £*=*/?/?  holds  in  G 
for  some  0'GPVPrr(<7,  z)}.  Thus,  if  PVPrr(<2,z)  is  known,  VPrr((7,z)  is  obtained  from  it 
through  appropriate  application  of  the  =**  relation. 
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The  incremental  aspect  of  GeneraLRR  becomes  apparent  in  the  computation  of  a set 
of  primitive  RR-associates.  Specifically,  given  VPri^G^z)  and  aET,  PVPrr(G,  az)  is 
obtained  by  an  application  of  the  la  relation  since  PVPrr(G,  az)  = {PE  V*  | a la  ft  holds  in  G 
for  some  oGVPrr(G,  z)}.  It  is  apparent  that  PVPrr{G,z)  and  VPrr(<7,z)  are  both 
nonempty  if  and  only  if  z GSUFFIX(G').  The  computation  of  the  primitive  RR-associates  of 
e,  a suffix  of  every  wET*,  serves  as  the  initialization  step.  Specifically,  PVPBR(G',e)  = 
{w|S^a;GP}. 

Lastly,  the  conditions  for  termination  of  GeneraLRR  are  specified.  First  suppose  that 
w GL(G).  In  this  case,  VPrr(G,  tv)  is  the  last  set  of  RR-associates  computed;  after  it  is  in 
place,  tv  is  accepted  based  on  the  fact  that  e GVPrr(G,  tv)  if  and  only  if  tv  GL(G).  Con- 
versely, suppose  that  tv  ^L(G).  If  tc^SUFFIX(G)  also  holds,  then  there  is  a unique  string 
z ET*  which  is  the  shortest  proper  suffix  of  tv  such  that  z ^SUFFIX(G)  holds.  In  this  case, 
PVPrr(G,z)  is  the  first  empty  set  computed  by  the  recognizer.  Otherwise,  if  tc^L(G)  and 
w GSUFFIX(G)  both  hold,  then  e^VPra^G,  w)  by  definition.  In  either  case,  the  input  string 
is  rejected. 

The  correctness  of  the  GeneraLRR  recognition  scheme  is  formally  established  in  the 
following  two  lemmas.  The  supporting  arguments  are  quite  straightforward  given  the  collec- 
tive results  to  this  point. 

Lemma  3.21  Let  w GL(G)  be  arbitrary.  If  GeneraLRR  is  applied  to  G and  w,  then 
GeneraLRR  accepts  tv. 

Proof.  By  definition,  PVPrr(G,  w:i)  and  VPrr(G,  w:i)  are  nonempty  for  all  i,  0<i  <len(tc). 
Thus,  the  for  loop  completes  len(tc)  iterations.  Since  tcGL(G)  by  assumption, 
eGVPRr^G,  tc).  Therefore,  w is  accepted  by  GeneraLRR  in  the  second  if  statement.  □ 

Lemma  3.22  Let  w $L(G)  be  arbitrary.  If  GeneraLRR  is  applied  to  G and  w,  then 
GeneraLRR  rejects  w. 

Proof.  There  are  two  cases  to  consider  based  on  whether  or  not  w is  in  SUFFIX(G). 
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Case  (i):  w CSUFFEX(G').  In  this  case,  PVPr^G1,  w.i)  and  VPrr(G ,w:i)  are  nonempty  for 
all  t,  0<!  <len(w),  so  the  for  loop  completes  len(ir)  iterations.  Since  w (£L(G)  by  assump- 
tion, w).  Therefore,  w is  rejected  by  General_RR  in  the  second  if  statement. 

Case  (ii):  w ^SUFFIX(G').  Since  w ^e,  w—xay  for  some  x,y(zT*  and  a£T  such  that 
y GSUFFD^G),  but  ay  ^SUFFIX(C).  Let  len(y)  = m and  note  that  0<m  <len(u>)  must 
hold.  P \^P rr( G , y .i)  and  VPri^G,  y:i)  are  nonempty  for  all  i,  0<i<m,  so  the  for  loop 
completes  m iterations.  During  the  (m-(-l)st  iteration,  PVPrr(G',  ay)  —0  is  computed. 
Therefore,  w is  rejected  by  General_RR  in  the  first  if  statement.  □ 

Regularity  Properties 

Certain  regularity  properties  that  are  inherent  to  all  context-free  grammars  are 
exploited  by  General_RR.  Specifically,  for  an  arbitrary  string  zET*,  PVPrr(G,z)  and 
VPrr(C,  z)  are  regular  languages.  This  fact  is  proven  in  this  section.  Toward  that  end  some 
known  theoretical  results,  including  one  which  is  rather  obscure,  are  cited  below.  Since 
proofs  of  these  results  are  not  replicated  here,  the  proofs  that  follow  are  quite  brief. 

A type  of  formal  rewriting  system  known  as  a regular  canonical  system  is  defined  by  C 
= (27,  IT)  w'here  27  is  an  alphabet  and  II  is  a finite  set  of  ( rewriting ) rules [21, 30, 37].  Each  rule 
in  77  takes  the  form  of  where  a,  /?G27*  and  £ denotes  an  arbitrary  string  over  27,  i.e., 

a variable.  The  form  of  a rule  indicates  that  the  left-hand  side  may  be  rewritten  to  its 
corresponding  right-hand  side  only  at  the  extreme  right  end  of  a string.  Thus,  much  like  R- 
derives,  the  C-derives  relation  induced  on  27*  by  77  is  defined  by  =*c  = {(7a, 7/?)  | 7G27*, 
£«— ►£/?G77}.  Given  two  languages  Li,L2£27*,  define  r(Lx,  C', L2)  by  {<5£27*|  7x=>c72<5  holds 
in  C for  some  7JGLJ  and  72€L2}. 

A key  result  from  the  literature  relevant  to  regular  canonical  systems  is  the  following. 

Fact  3.1  Let  C = (27, 77)  be  a regular  canonical  system  and  let  Lx  and  L2  be  regular 
languages  over  27.  Then  r(L1;  C,  L2)  is  a regular  language  over  27. 

Proof.  This  is  a restatement  of  Theorem  3 from  Greibach  [21],  □ 
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The  proof  that  PVPrr(G,  z)  and  VPrr(G,  z ) are  regular  languages  is  based  indirectly  on 
proofs  that  and  I are  regularity-preserving  relations.  First,  a relationship  is  established 
between  context-free  grammars  and  regular  canonical  systems.  Specifically,  for  a grammar 
G = (V,  T,P,S),  the  regular  canonical  system  induced  by  G is  defined  by  C = (F,  £P) 
where  £P  = {£4  — ► £a>|  A — >u)EP}. 

Lemma  3.23  Relation  =$p  is  regularity-preserving. 

Proof.  Let  G — ( V,T,P,S ) be  a grammar,  C = (F,  £P)  the  regular  canonical  system 
induced  by  G,  and  L an  arbitrary  regular  language  over  F.  By  Fact  3.1,  r(L,  G,{e})  = 
{<5£  F*  | ^GL,  holds  in  C } is  regular.  Since  the  =**  and  relations  are  equivalent, 

=►£  (L)  = r(L,  C,  {e}).  Therefore,  =>«  is  regularity-preserving.  □ 

Lemma  3.24  Relation  I is  regularity-preserving. 

Proof.  Let  G = (F,  T,P,S)  be  a grammar  and  let  LC  F*  be  an  arbitrary  regular  language. 
The  quotient  of  a language  Lx  with  respect  to  a language  L2  is  defined  by  L1(/L2  = 
{x  | xy  GLX  for  some  y €L2}.  Since  the  quotient  of  a regular  language  with  respect  to  an 
arbitrary  set  is  a regular  language  [24],  Va  ET,  la(L)  = L/{a}  is  regular.  Therefore,  I is 
regularity-preserving.  □ 

Theorem  3.25  Let  G = (V,  T,P,S ) be  an  arbitrary  grammar  and  let  zET*  be  an 
arbitrary  string.  Then  PVPrr(G,  z ) and  VPrr(G,  z)  are  regular  languages. 

Proof.  By  induction  on  len(2),  this  theorem  follows  from  Lemmas  3.23  and  3.24  and  the  fact 
that  PVPRR(G,e)  = {w|  S— rojEP)  is  regular.  □ 

Top-Down  Left-to-Right  Recognition 

In  this  section,  a general  top-down  recognition  scheme  that  presumes  a left-to-right 
scan  of  the  input  string  is  formally  developed.  Toward  that  end,  consider  the  two  relations 
on  V*  defined  by  {(A/3,ui0)  \ A—*u>EP,  /?GF*}  and  {{a/3,  ff)\  aET,  0EV*}.  Informally, 
these  relations  represent  left-biased  counterparts  of  =*r  and  I,  respectively.  Along  the  lines 
of  General_RR,  a general  top-down  correct-prefix  recognizer  can  be  based  on  these  two  rela- 
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tions.  Specifically,  leftmost  derivations,  left  sentential  forms,  etc.,  can  be  defined  in  terms  of 
these  relations  analogously  to  how  rightmost  derivations,  right  sentential  forms,  etc.,  are 
expressed  in  terms  of  =$p  and  I.  However,  an  alternate  approach  is  suggested  by  the  follow- 
ing result. 

Fact  3.2  For  a,  fie  V*,  (1)  <*=>,*£  holds  in  G if  and  only  if  oP  =*r P*  holds  in  GR\  (2) 
a=**  fi  holds  in  G if  and  only  if  oP  =>*  ft*  holds  in  GR . 

Proof.  A slightly  stronger  statement  is  presented  by  Sippu  and  Soisalon-Soininen  as  Fact  3.1 
[38],  □ 

For  future  reference,  some  useful  equivalences  that  are  implied  by  Fact  3.2  include  the 
following:  (1)  L(Gfi)  = (L(G))*,  (2)  PREFIX(G*)  = (SUFFIX(G))fl,  and  (3)  SUFFIX(Gfl) 
= (PREFtX(G'))fi. 

Fact  3.2  is  exploited  rather  extensively  in  what  follows.  In  particular,  leftmost  deriva- 
tions in  G — and  ultimately  general  top-down  correct^prefix  recognition  — are  described  in 
terms  of  strong  rightmost  derivations  in  GR  and  the  chop  relation.  Consequently,  a substan- 
tial portion  of  the  results  derived  in  the  previous  section  are  useful  here  as  well.  This 
economizes  on  our  efforts  considerably. 

Strong  Rightmost  Derivations  in  Reversed  Grammars 

The  R-derives  relation  induced  on  V*  by  PR  is  defined  by  = {(a:A,a:a;)| 

A— Kjj£Pr}.  The  relationship  between  strong  rightmost  derivations  in  GR  and  leftmost 
derivations  in  G is  the  subject  of  the  next  series  of  lemmas.4 

Lemma  3.26  For  a,  /?€  V"*,  if  a=*p  holds  in  GR , then  =** /fl  holds  in  G . 

Proof.  By  assumption,  a =*r /3  holds  in  GR . This  implies  that  a=l?/3  holds  in  GR  by 
Lemma  3.1.  It  follows  from  Fact  3.2  that  =>*  fP  holds  in  G.  □ 

Lemma  3.27  For  a, /?6  V*  and  A 67V,  if  a=^* Aft  holds  in  G,  then  af*  =$r  A holds 
in  Gr. 


4 The  chop  relations  relevant  to  G and  GR  are  identical 
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Proof.  Assume  that  a=$*Aft  holds  in  G.  By  Fact  3.2,  aP  =>*(A  ft)R  =ftRA  holds  in  GR . 
Thus,  oft  =$r  ftR  A also  holds  in  GR  by  Lemma  3.2.  □ 

Lemma  3.28  For  aGb’  and  a E T,  if  a=**aft  holds  in  G for  some  ftE  F*  then 
ofi  =>r  7 a holds  in  GR  for  some  7 £ V*. 

Proof.  If  a=>*aft  holds  in  G for  some  ftE  F*,  then  oft*  =$?(aft)R  =ftR  a holds  in  GR  by  Fact 
3.2.  By  Lemma  3.3,  it  follows  that  o ^ =>r1  a holds  in  GR  for  some  7E  V*.  □ 

Lemma  3.29  For  A EN  and  X £ V,  X is  left-reachable  from  A in  G if  and  only  if  X 
is  right-reachable  from  A in  GR . 

Proof.  Assume  that  A =>?Xft  holds  in  G for  some  ftE  V*.  By  Fact  3.2,  A =*?(Xft)R  =ftRX 
holds  in  GR , so  A =*rQ:X  holds  in  GR  for  some  or£  V*.  This  latter  conclusion  follows  from 
Lemma  3.2  if  XEN,  and  from  Lemma  3.3  otherwise.  Conversely,  suppose  that  A =*rQ:X 
holds  in  GR  for  some  <a£  F*.  It  follows  from  Lemma  3.1  that  A =*ftaX  holds  in  GR . By 
Fact  3.2,  A =*,+(aX)fl  =XaR  holds  in  G.  □ 

Corollary  For  A EN,  A is  left-recursive  in  G if  and  only  if  A is  right-recursive  in 

Gr  . □ 

Clearly,  the  nullability  of  vocabulary  symbols  is  invariant  with  respect  to  grammar 
reversal.  Thus,  the  following  statements  are  equivalent  for  X £ V:  (1)  X is  nullable  in  G ; 
(2)  A'  e holds  in  (7;  (3)  X =$r  e holds  in  GR . This  observation  is  easily  generalized  to 

strings  in  F*. 

Although  Lemma  3.6  obviously  applies  to  GR , it  is  restated  below  in  terms  of  GR 
because  of  its  importance  in  showing  how  the  =>«  and  I relations  cooperate. 

Lemma  3.30  For  a£F*,  at  least  one  of  the  following  two  statements  is  true:  (l) 
a=*R  fta  holds  in  GR  for  some  ftEV*  and  a ET,  ( 2)  a=>R(  holds  in  GR . □ 

Left  Sentential  Forms  Revisited 

The  left  sentential  forms  and  sentences  of  G are  defined  in  terms  of  the  R-derives  and 
chop  relations  of  GR . Similar  to  rightmost  derivations,  a leftmost  derivation  in  G is  ren- 
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dered  as  an  alternation  of  strong  rightmost  derivations  in  GR  and  rightmost  chops  of  termi- 
nal symbols. 

Lemma  3.31  For  a,/3EV*  and  zGT*  if  a (=*«  I) * =*«  ft  holds  in  GR , then 
oP  =**(/3x)R  =xR  ft1*  holds  in  G. 

Proof.  By  assumption,  a(=^R  !)*=*«  0 holds  in  GR . It  follows  from  Lemma  3.9  that  a=$*  fix 
also  holds  in  GR . This  implies  that  o ^ =**(ftx)R  =xR  ft1*  holds  in  G by  Fact  3.2.  □ 

Lemma  3.32  For  a,  ft  EV*,  let  a =>*(3  hold  in  G.  Write  ft  as  17  for  some  x E T*  and 
7EV*  such  that  ^ENV*  if  ftET*NV*  and  7=e  otherwise  (i.e.,  x is  the  longest  prefix  of  ft 
that  is  made  up  of  only  terminal  symbols).  Then  of1  (=4r  holds  in  GR . 

Proof.  Assume  that  the  conditions  in  the  hypothesis  of  the  lemma  hold.  From  the  assump- 
tion that  a=**ft  holds  in  G and  Fact  3.2,  of*  =*?  ft1*  holds  in  GR . Since  /?=£7, 
ftR  =(xf)R  =7*  xR . Thus,  xR  is  the  longest  suffix  of  ftP  that  is  made  up  of  terminal  symbols 
alone.  We  conclude  from  Lemma  3.12  that  ofi  (=*r  ^r^rI^  holds  in  GR . □ 

Theorem  3.33  SF,(G)  = {7EVI  5(=»«I)*=>rO  holds  in  GR  for  some  aEV*  and 
x E T*  such  that  ')=(ax)R}. 

Proof.  First  suppose  that  S (=*«  I)  * =*r  a holds  in  GR  for  some  aEV*  and  x El*.  By 
Lemma  3.31,  this  implies  that  S=$*(ax)R  =xRaR  holds  in  G,  so  (ai)fl  is  a left  sentential 
form  of  G.  Conversely,  assume  that  S =W*7  holds  in  G for  some  7EV*.  Let 
r)=xRaR  =(ax)R  for  xET*  and  aEV*  such  that  xR  is  the  longest  prefix  of  7 contained  in 
T*.  This  implies,  by  Lemma  3.32,  that  S !)*=►«  a holds  in  GR . □ 

Corollary  L(G)  = {t»  E T*  \ S (=>£  l)*„  =>£  e holds  in  GR}.  □ 

Corollary  PREFIX(  G)  = {x  E T*  \ S (=**  l)*„  a holds  in  GR  for  some  aEV *}.  □ 

Viable  Suffixes 

A top-down  complement  to  the  class  of  LR(£)  grammars  is  the  class  of  LL(fc)  grammars 
[28,36],  A theory  of  LL(£)  parsing  that  is  a dual  to  the  theory  of  LR(fc)  parsing  is  developed 
by  Sippu  and  Soisalon-Soininen  [38] . In  particular,  the  concept  of  a viable  suffix  is  introduced 
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as  the  LL  dual  to  the  viable  prefix  and  plays  a commensurately  central  role  in  the  theory. 
Symmetrically  to  the  definition  of  viable  prefixes,  viable  suffixes  are  defined  in  terms  of  left- 
most derivations  and  left  sentential  forms.  A string  7G  V*  is  a viable  suffix  of  G if 
S =**  xA6=*t  xa/36  = xa^f*  holds  in  G for  some  xGT*,  A—+a/3&P , and  6GU*  Thus, 
viable  suffixes  are  reversals  of  certain  suffixes  of  left  sentential  forms.  The  set  of  viable 
suffixes  of  G is  denoted  by  VS(£7). 

The  next  series  of  lemmas  develops  a definition  of  the  viable  suffixes  of  G in  terms  of 
the  =*r  and  I relations  of  GR . In  that  regard,  the  following  result  is  useful. 

Fact  3.3  (1)  A string  7G  V*  is  a viable  prefix  of  G if  and  only  if  7 is  a viable  suffix  of 

Gr  ; (2)  a string  7G  V*  is  a viable  suffix  of  G if  and  only  if  7 is  a viable  prefix  of  GR . 

Proof.  This  is  presented  by  Sippu  and  Soisalon-Soininen  as  Fact  3.2  [38].  □ 

Lemma  3.34  For  a,  /?€  U*,  if  a is  a viable  suffix  of  G and  a=*R  j3  holds  in  GR , then  /? 
is  a viable  suffix  of  G . 

Proof.  If  a is  a viable  suffix  of  G,  then  a is  a viable  prefix  of  GR . Since  <*=*«  (3  holds  in 
Gr  , is  a viable  prefix  of  GR  as  well.  Therefore,  /?  is  a viable  suffix  of  G.  □ 

Lemma  3.35  For  a,  (3E  V*,  if  a is  a viable  suffix  of  G and  a=$R  /3  holds  in  GR , then 
is  a viable  suffix  of  G . 

Proof.  This  is  a consequence  of  Lemmas  3.15  and  3.34.  □ 

Lemma  3.36  For  a,  /?£  V*,  if  a is  a viable  suffix  of  G and  a \/3  holds  in  GR , then  f)  is 
a viable  suffix  of  G. 

Proof.  Using  Fact  3.3,  the  proof  of  this  lemma  parallels  that  of  Lemma  3.16.  □ 

Lemma  3.37  For  7G  V*t  if  cu(=>«  I)*=*b7  holds  in  GR  for  some  S—kjjEPr  and 
x G T*,  then  7 is  a viable  suffix  of  G. 

Proof.  Assume  that  oj(=>r  I) * =*r 7 holds  in  GR  for  some  S—+(J&PR  and  x(zT*.  By 
Lemma  3.18,  this  implies  that  7 is  a viable  prefix  of  GR . Thus,  7GVS(G')  by  Fact  3.3.  □ 

Lemma  3.38  For7GU*,  if7isa  viable  suffix  of  G,  then  u>(=>R  Oi  holds  in  GR 
for  some  S—+cj(EPr  and  x G T*. 
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Proof.  Assume  that  7 is  a viable  suffix  of  G.  By  Fact  3.3,  7 is  also  a viable  prefix  of  GR . 
Thus,  a ;(=>*  I)*=*r7  holds  in  GR  for  some  S—ruEPR  and  x G T*  by  Lemma  3.19.  □ 

Theorem  3.39  VS(G)  = {-yG  V*  | !)*=*•«  1 holds  in  GR  for  some  S— kjj£Pr  and 

x g r}. 

Proof.  This  theorem  combines  Lemmas  3.37  and  3.38.  □ 

Corollary  VS(G)  = {7G  V*  \ S (=*r  U l)+7  holds  in  GR}.  □ 

Corollary  For  a,/?G  V*,  if  aGVS(G)  and  (*(=*/?  U I)*/?  holds  in  GR , then  /?GVS(G). 

□ 

General  Top-Down  Correct-Prefix  Recognition 

Let  w€T*  be  an  arbitrary  input  string.  A top-down  scheme  for  recognizing  w with 
respect  to  G that  is  a left-to-right  analog  of  General— RR  is  described  next.  This  scheme, 
called  General_LL,  scans  xv  from  left  to  right  as  it  recognizes  an  incrementally  longer  prefix 
of  the  input  string.  GeneraLLL  effectively  pursues  all  of  the  leftmost  derivations  of  w in 
parallel  through  regularity-preserving  operations  on  regular  subsets  of  VS(G). 

Again,  the  inherent  nondeterminism  of  general  context-free  recognition  subverts  any 
attempt  to  follow  exclusively  the  leftmost  derivations  of  w.  Instead,  at  the  point  where  a 
prefix  x of  w has  been  processed,  all  leftmost  derivations  (from  S)  of  all  strings  in 
arT*nL(G')  are  followed  (i.e.,  all  sentences  that  have  z as  a prefix). 

The  essence  of  GeneraLLL  mirrors  that  of  General—RR.  Let  x GT*  be  a prefix  of  w. 
Suppose  that  all  proper  prefixes  of  x are  members  of  PREFIX(Gf).  The  set  of  strings  defined 
by  {^GVS(G')|  holds  in  G}  determines  if  xGPREFIX(G)  holds.  This  set  is 

nonempty  if  and  only  if  x GPREFEX(G)  and  it  contains  e if  and  only  if  x GL (G). 
General—LL,  shown  in  Figure  3.2,  is  described  in  greater  detail  in  what  follows. 

For  arbitrary  x G T*,  two  sets  of  viable  suffixes  are  identified  with  x.  The  first  set,  the 
primitive  LL-associates  of  x (in  G),  is  defined  by  PVSll(G,2:)  = {ft€LV*\  u(=*r\)*r0  holds 
in  Gr  for  some  S— kjj£Pr}.  The  other  set  contains  the  LL-associates  of  x (in  G)  and  is 
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function  GeneraLLL(Gfi  =(  V,  T,PR ,5);  w G T*) 

/ / w =a,a0  ■ ■ ■ a„,  n >0,  each  a,  G T 
PVSll (G,e)  :=  {cj^-fwGP^} 
for  i :=0to  n— 1 do 

VSll(G,  i:w)  :=  =**r  (PVSi^G,  i:w)) 
PVSi4G,i+l:w)  :=  (VSm( G? , i:tr)) 

if  PVSll(G,  t+l:u>)  = 0 then  Reject(tr)  fi 
od 

VSll(G,  «;)  :=  ^ (PVSll (G,  u>)) 

if  cGVSll(G,  tn)  then  Accept(ic)  else  Reject(ic)  fi 

end 


Figure  3.2  — A General  Top-Down  Correct-Prefix  Recognizer 


defined  by  VSix(G,:r)  = {(3EV*\  v(=*r  l)*„  =**  0 holds  in  GR  for  some  S—*oj£Pr}.  By 
Theorems  3.33  and  3.39,  VSu/G.z)  = (£GVS(G)  | S =**xfF  holds  in  G}  which  is  precisely 
the  set  described  in  the  previous  paragraph.  Input  string  w is  recognized  by  computing 
PVSll(G,  i'.w ) and  VSll(G,  i:w)  as  i ranges  from  0 to  len(ic). 

The  set  VSll(G,  j)  is  equivalently  expressed  as  {/3 G V*  \ a =$r  /3  holds  in  GR  for  some 
<y  GPVSll(G,  2:)};  this  form  explicitly  reflects  that  VSll(G,z)  is  the  reflexive-transitive  clo- 
sure of  PVSll(G,j)  under  the  =>«  relation.  Thus,  VSll(G,3:)  is  computed  by  applying  =*S 
to  PVSll(G,  x). 

Given  VSuj[G,x ) and  aET,  PVSl i{G,xa)  is  determined  from  VSu(G,:r)  through  an 
application  of  the  la  relation  since  PVSi4G,;ra)  = {/? G V*  | a la  /?  holds  in  GR  for  some 
<*GVSll(G,  x)}.  Clearly,  PVSi4G,:r)  and  VSll(G,2:)  are  both  nonempty  if  and  only  if 
x GPREFIX(G).  The  initialization  step  entails  computing  the  primitive  LL- associates  of  e, 
i.e.,  PVSi4G,e)  = {w|  S-kjEPr}. 

The  conditions  under  which  GeneraLLL  terminates  are  analogous  to  those  of 
General_RR.  If  w GL(G),  then  VSll(G,  w)  is  the  last  set  of  LL-associates  computed;  after  it 
is  known,  w is  accepted  since  eEVSu{G,tv)  if  and  only  if  toGL(G).  Conversely,  suppose 
that  w ^L(G).  If  w ^PREFIX(G)  also  holds,  then  there  is  a unique  string  xET*  which  is 
the  shortest  prefix  of  w such  that  x ^PREFIX(G)  holds.  In  this  case,  PVSll(G,  x)  is  the  first 
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empty  set  computed  by  the  recognizer.  Otherwise,  if  w (fh{G)  and  w GPREFEX(G')  both 
hold,  then  VSl l{G,  ic)  is  found  not  to  contain  e.  In  either  case,  w is  rejected. 

The  correctness  of  the  General_LL  recognition  scheme  is  formally  established  in  the  fol- 
lowing two  lemmas. 

Lemma  3.40  Let  w GL(G)  be  arbitrary.  If  General_LL  is  applied  to  GR  and  w,  then 
GeneraLLL  accepts  w. 

Proof.  Since  every  prefix  of  w is  in  PREFIX((7),  PVSu^G1,  i:w)  and  VSli^G,  i:w)  are 
nonempty  for  all  t,  0<t  <len(tc).  Thus,  the  for  loop  of  General—LL  completes  len(u/)  itera- 
tions. By  assumption,  w GL(G),  so  eGVS ufG,iv).  Consequently,  w is  accepted  by 
GeneraLLL  in  the  second  if  statement.  □ 

Lemma  3.41  Let  w ^L(G)  be  arbitrary.  If  GeneraLLL  is  applied  to  GR  and  w,  then 
GeneraLLL  rejects  w. 

Proof.  There  are  two  cases  to  consider  depending  on  whether  or  not  w GPREFIX(G'). 

Case  (i):  w GPREFK(G').  In  this  case,  PVSl^G,  i:w)  and  YSli^G,  i:w)  are  nonempty  for  all 
i,  0<i  <len(w),  so  the  for  loop  of  GeneraLLL  completes  len(w)  iterations.  Since  w ^L(G) 
by  assumption,  e^VSu^G1,  w).  Therefore,  w is  rejected  by  GeneraLLL  in  the  if  statement 
that  follows  the  for  loop. 

Case  (ii):  w ^PREFEX(G).  Let  x G T*  be  the  unique  string  which  is  the  longest  prefix  of  w 
such  that  x GPREFK(G)  holds.  Let  len(i)  = m and  note  that  0<m<len(u;).  Since 
PVSll(G,i:j)  and  VSu^G,  i:x)  are  nonempty  for  all  i,  0<t<m,  the  for  loop  of 
GeneraLLL  completes  m iterations.  During  the  (m+l)st  iteration,  PVSLL(G,(m+l):u>)=0 
is  computed.  Therefore,  w is  rejected  by  GeneraLLL  in  the  first  if  statement.  □ 

Regularity  Properties 

The  regularity  properties  inherent  to  all  context-free  grammars  that  are  exploited  by 
GeneraLLL  are  identified  in  the  following. 
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Theorem  3.42  Let  G = (V,  T,P,S)  be  an  arbitrary  grammar  and  x an  arbitrary 
string  over  T.  Then  PVS14C,  *)  and  VSii^G,  ar)  are  regular  languages. 

Proof.  The  proof  is  by  induction  on  len(x)  = n.  In  particular,  we  show  that  PVSu^G^a;)  = 
PVPRR(G'fi  ,xR)  and  VSll {G,x)  = VPw^G*1 , xR).  The  proof  is  mostly  an  exercise  in  recalling 
definitions  and  putting  them  in  the  appropriate  form. 

Basis  (n  =0).  The  following  two  equalities  are  obvious:  (1)  PVSm(G,€)  = {cuEV*| 

S^uePR}  = PVPrr (GR,e);  (2)  VSuJ(G,,e)  = {/?EV*|  a=>ip  holds  in  GR  for  some 
a EPVSll(  <?,€)}  = {pEV  \ a=*R  f)  holds  in  GR  for  some  aGPVPR^G^,  c)}. 

Induction  (n  > 0).  Let  x =ya  for  some  yGT"-1  and  a E 7\  By  the  induction  hypothesis, 
PVSllIG',  y)  = PVPrr {GR,yR)  and  VSi ifG,y)  = VPrr (GR,yR).  Hence,  PVS^G,  ya)  = 
{/?EV*|  a\a0  for  some  a£VSulG,y)}  = {PEV\  a\J  for  some  aEVPHR(G*,  y*)}  = 
P VPrr( Gr , ay r)  = PVPrr( Gr ,(ya)R ).  Finally,  VSll(G,  ya)  = {/?E  V*  | <*=►£/?  holds  in  GR 
for  some  a GPVSu^G,  ya)}  = {^G  V*\  a=^R^  holds  in  GR  for  some  a G PVPrr( Gr , (ya)R)} 
= VPrr (Gr ,(ya)R).  From  Theorem  3.25,  we  conclude  that  PVSll(G,x)  and  VSu^G^)  are 
regular  languages.  □ 


Discussion 

A simple  framework  for  describing  general  canonical  top-down  recognition  was 
presented.  The  set-theoretic  framework  is  based  on  two  relations  on  strings,  and  I.  A 
key  property  of  both  of  these  relations  is  that  they  preserve  regularity.  The  essence  of  gen- 
eral top-down  recognition  was  captured  in  terms  of  computing  the  images  of  regular  sets 
under  these  relations. 

The  definitions  of  the  various  objects  of  importance  in  the  framework,  namely  sen- 
tences, suffixes  and  prefixes  of  sentences,  right  and  left  sentential  forms,  etc.,  were  cast  in 
terms  of  the  =**  and  I relations.  Consequently,  it  is  a small  step  from  these  definitions  to 
the  recognition  schemes  that  are  based  on  them.  In  addition,  the  correctness  of  the  recogniz- 
ers is  particularly  easy  to  establish. 
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Given  the  impracticality  of  scanning  input  strings  from  right  to  left,  it  is  worth 
reflecting  on  why  strong  rightmost  derivations  were  chosen  over  strong  leftmost  derivations 
as  a point  of  origin.  If  General_LL  had  been  developed  first,  the  evolution  from  GeneraLXL 
to  General_RR  certainly  would  have  been  no  more  involved  than  the  progression  in  the  other 
direction.  However,  strong  rightmost  derivations  were  favored  from  the  outset  because 
viable  prefixes  are  considerably  more  ingrained  in  the  literature  than  are  viable  suffixes.5  In 
addition,  the  bottom-up  left-to-right  counterpart  to  General_RR  that  is  developed  in  the 
next  chapter  is  derived  directly  from  General_RR.  Considerable  attention  is  devoted  to  this 
derivative  of  the  GeneraL_RR  recognition  scheme  in  the  rest  of  this  work. 


6 To  date,  we  have  yet  to  find  a reference  to  Sippu  and  Soisalon-Soininen  [38]  in  the  literature 


CHAPTER  IV 

GENERAL  BOTTOM-UP  RECOGNITION:  A FORMAL  FRAMEWORK 

A formal  framework  for  describing  general  bottom-up  recognition  is  developed  next.  In 
particular,  a general  bottom-up  recognition  scheme  that  scans  input  strings  from  left  to  right 
is  presented.  The  bottom-up  left-to-right  character  of  the  recognition  scheme,  called 
General_LR,  intimates  that  it  is  an  inverse  of  GeneraLRR.  Indeed,  General_LR  is  directly 
derived  from  General_RR  through  inverses  of  the  R-derives  and  chop  relations.  Conse- 
quently, General_LR  also  exploits  certain  regularity  properties  of  context-free  grammars. 

In  keeping  with  Chapter  III,  some  formal  aspects  of  general  bottom-up  recognition  are 
examined  in  a set-theoretic  framework.  Later  chapters  affect  a less  abstract  character; 
specifically,  General_LR  is  cast  into  concrete  terms,  viz.,  state-transition  graphs  and  finite- 
state  automata.  Ultimately,  a general  bottom-up  parser  based  on  General_LR  is  described. 
An  arbitrary  reduced  grammar  G =(V,T,P,S)  is  assumed  throughout  this  chapter. 

Bottom-Up  Left-to-Right  Recognition 

In  a bottom-up  approach  to  recognition,  an  attempt  is  made  to  construct  a parse  tree 
for  an  input  string,  perhaps  implicitly,  by  starting  from  the  leaves  and  working  toward  the 
root.  A basic  step  in  the  upward  synthesis  of  a parse  tree  involves  grafting  together  the 
roots  of  one  or  more  subtrees  into  a larger  subtree.  Suppose  that  the  collection  of  these  sub- 
trees is  represented  by  the  string  of  grammar  symbols  which  label  their  roots.  A grafting 
operation  may  be  described  in  terms  of  applying  the  inverse  of  the  =*•  relation  to  this  linear- 
ized form  of  the  partially  constructed  parse  tree.  That  is,  the  occurrence  of  a production 
right-hand  side  in  this  string  is  replaced  by  (or  reduced  to)  the  corresponding  left-hand  side 
nonterminal  symbol;  this  symbol  labels  the  root  of  the  subtree  produced  by  the  grafting 
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operation.  By  performing  reductions  according  to  the  inverse  of  the  =>,  relation  instead,  a 
canonical  left-to-right  order  is  imposed  on  the  parse  tree  construction  process. 

However,  an  alternative  to  the  inverse  of  the  =>,  relation  is  provided  by  inverses  of  the 
=4ft  and  I relations.  The  inverse  of  =>/?  is  used  to  represent  reversed  strong  rightmost 
derivations.  The  inverse  of  I introduces  terminal  symbols  at  the  right  end  of  strings.  These 
two  inverse  relations  cooperate  to  mimic  reversed  rightmost  derivations. 

Reversed  Rightmost  Derivations 

The  reduce  relation  (f=)  is  the  inverse  of  the  R-derives  relation,  i.e.,  =>/?-1  = |=;  it  is 
formally  defined  by  = {(o-w,  aA ) | V*,  A — ► cjGP}.  The  shift  relation  (*■)  is  the  inverse 

of  the  chop  relation,  i.e.,  I-1  = thus,  *■  = {(<a,  aa)  | a&  V™,  a € T}.  For  each  a G T,  <-0 
denotes  the  subrelation  of  «-  with  range  V*a.  More  specifically,  for  and  aGT, 

a*-a  /?  if  and  only  if  a+-/3  and  fi—aa. 

For  the  most  part,  the  results  in  this  chapter  are  obtained  through  simple  manipula- 
tions of  relational  expressions.  Two  equalities  on  relational  expressions  that  are  regularly 
used  in  these  transformations  are  recorded  in  the  following. 

Fact  4.1  Let  R and  S be  binary  relations  on  V*,  i.e.,  R , S QV*XV*.  Then  the  fol- 
lowing two  statements  hold:  (1)  (P*)-1  = (P-1)*;  (2)  (P  S)-1  = S-1  P_1.  □ 

Some  useful  applications  of  Fact  4.1  include  the  following. 

(1)  ( )-'=(=>* -1)*  = K; 

(2)  (=►*•! )-1  = r1(=***)-1  = «-K; 

(3)  ( i)*^*  r1 = ( r1  ( Hi  or1  - k ( o_1r = k ( «-ht 

Despite  the  appearance  of  (<-(=*)  in  the  last  construct  of  both  (2)  and  (3),  the  relation  pro- 
duct ()=*■«-)  is  more  appropriate  to  our  needs.  Indeed,  since  relation  composition  is  associa- 
tive, the  following  equivalence  holds:  f=*  («-|=*)*  = Q=*«-)*|=*. 

The  interpretation  of  the  relation  product  (}=*«-)  is  explicitly  described  as  follows.  For 
a,  /?£  V *,  a(^=**-)P  holds  in  G if  and  only  if  o[=*7  «-a  7a  holds  in  G for  some  7G  V""  and 
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a ET.  This  is  expressed  more  neatly  as  aQ=*«-)a  0.  The  notation  relevant  to  the  reflexive- 
transitive  closure  of  this  product  is  as  follows.  For  all  aEV"*,  a(J=*+-)e°Qf  holds  in  G;  for 
a,0,iEV,  Ter""1  with  n>l,  and  aET,  if  e*(J=**-)"  10  and  /^0=**")a  T hold  in  G,  then 
a(]=*«-)"a  7 holds  in  G.  If  a ()=*«-)"  0 holds  in  G for  some  a,  0E  V*  and  x ETn , n >0,  any  of 
the  expressions  a(J=* ■*-)*/?,  a Q=*  «-)*/?,  or  a(]=*<-)n  0 may  be  used  to  denote  this  if  the  string  x 
or  its  length  n is  not  relevant. 

The  following  lemma  compares  relational  expressions  involving  the  =*«  and  I relations 
with  relational  expressions  involving  the  (=  and  *-  relations. 

Lemma  4.1  For  a,0EV*  and  xET*,  a(=*«  !)*=*«  0 holds  in  G if  and  only  if 
^§=V);pa  holds  in  G. 

Proof.  First  suppose  that  x =e.  By  definition,  both  (=*j?  I)£°  and  fl=*«-)£°  are  equivalent  to  the 


identity  relation  on  V*.  Thus,  t 

;he  following  statements  are  equivalent. 

(1) 

a{=*S\)?=>S0; 

(2) 

a=*R  0; 

(3) 

0(=>S)-1or, 

(4) 

0\=*  a;  and 

(5) 

Now  let  x 

= ala2  ‘ ‘ ' a„,  n >1 

. The  following  statements  are  equivalent  in  this  case. 

(1) 

<x(=>r  l)"=t«  0] 

(2) 

0({=>W=>sr1<*-, 

(3) 

(=*«  ) 1 ai 

(4) 

=>*  k.  • 

■ ■ =*«  !.,=♦*  )_1«; 

(5) 

0(\=**-a1\=**‘a2  ■ • • 

(6) 

• • • 

F^K°:  and 

(7) 

0(K-)«nK«-  D 
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The  next  two  lemmas  demonstrate  how  reversed  rightmost  derivations  are  represented 
by  the  [=  and  «-  relations. 

Lemma  4.2  For  a, /?G  V*  and  x G T*,  if  or (}=*«-) *[=*/?  holds  in  G,  then  0=*? ax  holds 

in  G. 

Proof.  By  Lemma  4.1,  the  hypothesis  implies  that  0(=>r  holds  in  G.  It  follows  from 

Lemma  3.9  that  0=^*ax  holds  in  G.  □ 

Lemma  4.3  For  a,0(EV*,  let  a=>?0  hold  in  G.  Furthermore,  let  0=qx  for  some 
7G  V*  and  x G T*  such  that  7G  V*N  if  0£  V*NT*  and  7=e  otherwise  (i.e.,  x is  the  longest 
suffix  of  0 consisting  solely  of  terminal  symbols).  Then  7()=*  «-)*[=*  c*  bolds  in  G . 

Proof.  The  hypothesis  and  its  conditions  imply  that  a (=>r  I)  * =*r  7 holds  in  G (see  Lemma 
3.12).  Therefore,  7 ()=*••-)*)=*  a holds  in  G by  Lemma  4.1.  □ 

Lemma  4.4  L(G)  = {u?  G T*  | e([=*«-)*|=*5  holds  in  (?}. 

Proof.  This  is  a consequence  of  Lemmas  4.2  and  4.3.  □ 

The  following  connection  is  established  between  PREFEX(G')  and  the  [=  and  *•  rela- 
tions. 

Lemma  4.5  PREFrX(G)  C {x  G T*  | efl=*«-)*  a holds  in  G for  some  o:G  F*}. 

Proof.  Let  x GPREFIX^G)  be  arbitrary.  The  corollaries  to  Theorem  3.13  together  with  the 
assumption  that  G is  reduced  yields  that  0(=*r  I)*  =4;?  e holds  in  G for  some  /?GF*.  By 
Lemma  4.1,  e (}=*■«-)  *(=*/?  also  holds  in  G.  Finally,  this  last  expression  implies  that 
efl=*+-)*a;|=*/?  holds  in  G for  some  aGF.  □ 

The  set  inclusion  of  the  preceding  lemma  is  almost  invariably  proper.  For  example, 
consider  the  grammar  with  production  set  P = {5— >-a}.  Although  this  grammar  generates 
{a},  € G=*'*_) * . o'  holds  for  all  i >0.  In  fact,  equality  holds  in  Lemma  4.5  only  for  grammars 
which  have  an  empty  terminal  alphabet. 
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Viable  Prefixes  Revisited 

Lemma  4.5  suggests  that  the  reduce  and  shift  relations,  as  defined,  are  inadequate  as  a 
basis  for  general  bottom-up  correct-prefix  recognition.  Indeed,  the  source  of  their  deficiency 
is  revealed  when  they  are  examined  under  the  guise  of  viable  prefixes. 

First,  recall  that  VP(G)  is  closed  with  respect  to  =»*  and  I.  Formally,  a string  aGT 
is  a viable  prefix  of  G if  and  only  if  oj  (=*/?  U I)*  a holds  in  G for  some  5— *-w€P . The  com- 
plimentary situation  that  exists  with  respect  to  the  f=  and  *■  relations  is  investigated  in  the 
next  series  of  lemmas. 

Lemma  4.6  For  ct,  /?E  V*,  if  £*(=/?  holds  in  G and  ft^VP(G),  then  /3(^VP(G). 

Proof.  The  contrapositive  of  this  implication  is  proven,  so  we  assume  that  /?EVP(G).  Since 
a\=/3  holds  in  G,  /?=»*  a also  holds.  By  Lemma  3.14,  this  implies  that  O'EVP(G).  □ 
Corollary  For  a, /3E  V*,  if  o:\=P  holds  in  G and  /3(EVP(G),  then  ft€\T(G).  □ 

Lemma  4.7  For  a,  /36  V*,  if  a«-/?  holds  in  G and  a^VP(G),  then  /?(^VP(G). 

Proof.  The  proof  is  similar  to  that  Lemma  4.6.  Lemma  3.16  is  relevant  in  this  case.  □ 
Corollary  For  a,  /9E  V*,  if  a*-f3  holds  in  G and  /?EVP(G),  then  aEVP(G).  □ 
Lemma  4.8  For  a,  /2E  V *,  if  a ([=  U *•)*  /?  holds  in  G and  a <5?VP(G),  then  /?^VP (G). 
Proof.  Since  a ([=U«-)*  /?  holds  in  G by  assumption,  a (}=U«-)n  /?  holds  for  some  n >0. 
Applying  Lemmas  4.6  and  4.7,  this  lemma  is  proven  by  induction  on  n.  □ 

Corollary  For  a,f3<E;V*,  if  a ( U *■)*  (3  holds  in  G and  /?EVP(G),  then  aEVP(G). 

□ 

By  Lemma  4.8,  V*\VP(G)  is  closed  with  respect  to  (f=U«-).  The  implication  to 
Lemma  4.1  of  this  complimentary  closure  property  is  addressed  in  the  following. 

Lemma  4.9  For  a,/3E.V*  and  x ET,  if  aEVP(G)  and  ot  (=^r  I)  * =*«  /?  holds  in  G, 
then  /?  (J=*  «-)*(=*  a holds  in  G when  (=  and  «-  are  restricted  to  VP(G). 

Proof.  By  assumption,  o is  a viable  prefix  of  G and  a (=»/?  I) * =*r  (3  holds  in  G.  From 
Lemma  4.1,  (}=*  «-)*(=*  a also  holds  in  G.  That  this  latter  expression  holds  when  |=  and  *■ 

are  restricted  to  VP(G)  follows  from  Lemma  4.8  and  its  corollary.  □ 
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Our  immediate  goal  is  to  describe  general  bottom-up  left-to-right  recognition  as  the 
inverse  of  general  top-down  right-to-left  recognition  with  the  viable  prefix  being  the  central 
unifying  concept.  F rom  that  standpoint,  it  is  undesirable  for  the  reduce  and  shift  relations  to 
stray  outside  of  VP(G).  Consequently,  these  two  relations  are  redefined  to  explicitly  restrict 
them  to  VP(G)  as  follows:  |=  = {(acj, aA ) | a6F*,  A—kjjEP,  aA  EVP(G)}  and  «-  = 
{(o',  aa)  | a&V*,  a £T , aa  EVP(G)}.  From  the  closure  result  of  Lemma  4.8,  restricting  the 
ranges  of  these  two  relations  to  VP(G)  effectively  restricts  their  domains  to  VP(G)  as  well. 
Henceforth,  these  new  restricted  versions  of  and  *-  are  in  affect  at  all  times. 

Lemma  4.10  VP(G)  = {aE  V'*  | e(j=*  «-)*[=*  a holds  in  G for  some  x E T*}. 

Proof.  Since  the  |=  and  *-  relations  are  restricted  to  VP(G),  it  is  clear  that  any  string 
a(z  V*  such  that  e (}=* ■*")  * o holds  in  G for  some  x E T*  is  a viable  prefix  of  G.  In  order  to 

show  that  every  viable  prefix  of  G is  similarly  produced,  let  a be  an  arbitrary  member  of 

VP(G).  From  Theorem  3.20,  holds  in  G for  some  S— ►ojEP  and  z(zT*. 

Since  G is  reduced,  a(=^j?  !)*=*£  e holds  in  G for  some  iEI*  (implying  2r^EL(G)).  It  fol- 
lows from  Lemma  4.9  that  e(}=* «-)*[=* a holds  in  G.  □ 

Corollary  L(G)  = {te  E T*  | e (}=*«-)*  f=*u;  holds  in  G for  some  S—*u;EP}.  □ 

Corollary  PREFEX(G)  = {iE  r*|  e (}=*••-) * a holds  in  G for  some  aEF*}.  □ 

Finally,  the  following  lemma  motivates,  ex  post  facto,  the  relation  product  (}=*«-). 
Lemma  4.11  For  O'EVP(G),  at  least  one  of  the  following  two  statements  is  true:  (l) 
a\=A  ft*-  fta  holds  in  G for  some  ft£V*  and  a E T\  (2)  a\=*  oj  holds  in  G for  some  S— *-o>EP. 
Proof.  By  Theorem  3.20,  w(=*/?  1)*=*/?  ot  holds  in  G for  some  S — ► wEP  and  zET*.  By 
Lemma  4.9,  a(}=*^)*^=*ci;  also  holds  in  G.  If  z=e,  then  O'(|=*^)£0[=*u;  holds  which  demon- 
strates that  statement  (2)  is  true.  Otherwise,  z—ay  for  some  oET  and  yEP.  In  this 
case,  afl=*«-)a  ')<}=* oj  holds  in  G for  some  qrEV"*.  This  last  expression  implies  that 
a\=*  ft*-fta  =7  holds  for  some  /?E  V*,  so  statement  (1)  is  true.  □ 
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General  Bottom-Up  Correct-Prefix  Recognition 

Now  that  f=  and  «-  are  defined  as  inverses,  albeit  restricted,  of  =»*  and  I,  respectively, 
the  transition  from  GeneraLRR  to  General_LR  is  completed  by  also  inverting  the  direction 
in  which  an  input  string  w G T*  is  scanned.  Accordingly,  the  essence  of  General—LR  is  that 
all  of  the  reversed  rightmost  derivations  of  in  G T*  are  followed  in  parallel. 

Once  again,  there  are  theoretical  limits  on  the  precision  to  which  this  task  may  be  car- 
ried out;  that  is,  it  is  not  possible  to  pursue  exclusively  the  reversed  rightmost  derivations  of 
w in  the  general  case.  Instead,  at  the  point  where  a prefix  x of  w has  been  processed,  all 
reversed  rightmost  derivations  (from  e)  of  all  strings  in  xT*nL(G)  are  followed  (i.e.,  all  sen- 
tences that  have  x as  a prefix). 

As  in  the  top-down  recognition  schemes,  regularity-preserving  operations  on  regular 
subsets  of  VP(G)  are  the  key  to  General_LR.  Correct-prefix  recognition  is  performed,  i.e., 
the  membership  in  PREFEX(G)  of  an  incrementally  longer  prefix  of  w is  ascertained  as  w is 
scanned  from  left  to  right.  Given  a prefix  x of  w , the  inclusion  of  x in  PREFK(G)  is  deter- 
mined from  the  set  {aGVP(G)  | a=*?x  holds  in  G}.  This  set  is  nonempty  if  and  only  if 
xGPREFIX(G),  and  it  contains  u for  some  S— ► cjGP  if  and  only  if  xGL(G).  Figure  4.1 
presents  a high-level  description  of  General_LR;  a more  detailed  discussion  follows. 


function  General—LR (G  =(V,  T,P,S );  w G T*) 

//  w =a1a2  • • • an,  n > 0,  each  ai  G T 
PVPi *(<?,£)  :={£} 

for  i :=  0 to  n—  1 do 

VPlr(G,  i:w)  :=  (=*(PVPlr(G,  i:w)) 

PVPlr(G,  i+l:t»)  :=  «-,i+i(VP«(G,  i:w)) 
if  PVPlr(G,  »+l:tr)  = 0 then  Reject(tn)  fi 

od 

VPlr(G,  w)  :=  h*(PVPi *(G,  «»)) 

if  wGVPu^G,  w)  for  some  S—*u€.P  then  Accept(w)  ebe  Reject(in)  fi 


Figure  4.1  — A General  Bottom-Up  Correct-Prefix  Recognizer 
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Let  x £ T*  be  an  arbitrary  string.  The  primitive  LR-associates  of  x (in  G ) are  defined 
by  PVPl r(G,x)  = {a£VP(G)|  e (}=*-)>  holds  in  G}.  Clearly,  PVP «(G,€)  = {e}.  The 
LR-associates  of  x (in  G ) are  defined  by  VPlr(G,x)  = {a£VP(G)|  e(H*'*")iH,Q'  holds  in 
G }.  By  Lemma  4.2,  this  set  is  equivalent  to  {cv£VP(G)  | Qr=4,* x holds  in  G}. 

An  input  string  w £ T*  is  recognized  by  GeneraLXR  through  the  computation  of 
PVPu^G,  i:w)  and  VPlr(G,  i:u>)  as  i ranges  from  0 to  len(u>).  The  process  terminates  when 
either  an  empty  set  is  produced  or  the  input  string  is  exhausted.  Analogous  to  the  top-down 
recognition  schemes,  the  relationships  between  VPlr(G,x)  and  PVPlr(G,x),  and  between 
PVPlr(G,  xa ) and  VPlr(G,  x)  are  significant.  Specifically,  for  x £ T*  and  a £ T,  VPlr (G,  x) 
= {/?£VP(G)|  a(=* 0 holds  in  G for  some  a £ PVPlr( G , x )}  = (=*  (PVPlr(G,  x))  and 
PVPlr(G,  xa)  = {/9€VP(G)  | a*-af3  holds  in  G for  some  Q'£VPLr(G,  x)}  = *-a  (VPlr (G,  x)). 

The  conditions  for  termination  are  analogous  to  those  for  GeneraLRR  and  GeneraLLL. 
Given  an  input  string  wET*,  first  suppose  that  ic£L(G).  In  this  case,  VPlr(G,ic)  is  the 
last  set  of  LR-associates  computed  by  General_LR;  after  it  is  completed,  w is  accepted  based 
on  the  fact  that  cj£VPlr(G,  w)  for  some  S—+u(zP  if  and  only  if  tc£L(G).  Alternatively, 
suppose  that  w ^L(G).  If  w ^PREFIX(G)  also  holds,  there  is  a unique  string  x £ T*  which 
is  the  shortest  prefix  of  w such  that  x ^PREFEX(G)  holds.  In  this  case,  PVPlr(G,x)  is  the 
first  empty  set  computed  by  the  recognizer.  On  the  other  hand,  suppose  that  w ^L(G)  and 
w £PREFK(G)  both  hold.  In  this  case,  it  is  discovered  that  w^VPlr(G,u>)  for  any 
S—KjJ&P.  In  either  case,  the  input  string  is  rejected  by  General_LR. 

The  correctness  of  GeneraULR  is  recorded  more  formally  in  the  next  two  lemmas. 

Lemma  4.12  Let  u>£L(G)  be  arbitrary.  If  GeneraLLR  is  applied  to  G and  w,  then 
GeneraLLR  accepts  w. 

Proof.  From  earlier  results,  PVPlr(G,  i:w)  and  VPlr(G,  t :w),  0<i<len(u>),  are  all 
nonempty.  Thus,  the  for  loop  of  General_LR  completes  len(tc)  iterations.  Since  w £L(G) 
by  assumption,  cj£VPlr(G,  w)  for  some  5-+u;£P.  Therefore,  w is  accepted  by 
GeneraLLR  in  the  second  if  statement.  □ 
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Lemma  4.13  Let  u’^L(G')  be  arbitrary.  If  General_LR  is  applied  to  G and  w,  then 
General_LR  rejects  w. 

Proof.  There  are  two  cases  to  consider  according  to  whether  or  not  w is  in  PREFK(G). 

Case  (i):  w GPREFIX(G).  In  this  case,  P VPlr{ G , t : ic ) and  VPi*(G,i:u>)  are  nonempty  for 
all  i,  0<j  <len(ic),  so  the  for  loop  of  GeneraLLR  completes  len(tc)  iterations.  Since 
w ^L(G)  by  assumption,  w^VPlr(G,  w)  for  any  S—*u>E.P.  Therefore,  w is  rejected  by 
General_LR  in  the  if  statement  that  follows  the  for  loop. 

Case  (ii):  w $SUFFK(G).  Let  x € T*  be  the  unique  string  which  is  the  longest  prefix  of  w 
such  that  x (EPREFIX(G)  holds.  Let  len(x)  = m and  note  that  0 <m  < len(tr).  For  all  i , 
0<t<m,  PVPlr(G,  i:x)  and  VPlr(G,i:x)  are  nonempty,  so  the  for  loop  of  GeneraLLR 
completes  m iterations.  During  the  (m+l)st  iteration,  PVPlr(G, (m+l):^)  =0  is  computed. 
Therefore,  w is  rejected  by  General_LR  in  the  if  statement  enclosed  within  the  for  loop.  □ 

Regularity  Properties 

The  regularity  properties  inherent  to  all  context-free  grammars  that  are  exploited  by 
General_LR  are  identified  in  this  section.  Specifically,  for  an  arbitrary  string  z£T* 
PVPlr(G,x)  and  VPlr(G,x)  are  regular  languages. 

Lemma  4.14  Relation  (=*  is  regularity-preserving. 

Proof.  Let  G = ( V,  T,  P,  S)  be  an  arbitrary  grammar  and  let  L be  an  arbitrary  regular  sub- 
set of  VP(G).  Define  the  regular  canonical  system  G = ( P,  77)  such  that  77  = 
{(£w,  £A)  \A— rcoEP}.  Since  =*c  is  defined  on  V * and  f=  is  defined  on  VP(G)  C V*,  |=  is  a 
subrelation  of  =*c . By  Fact  3.1,  L'  = r(L,C,{e})  is  a regular  language.  Since  regular 
languages  are  closed  under  intersection,  L'nVP(G)  is  regular.  Clearly,  (=*  (L)  CL,nVP(G) 
holds,  since  |=  is  a subrelation  of  ==>c  that  is  restricted  to  VP(G).  The  converse  inclusion, 
viz.,  L'flVP(G)  C (=*  (L),  is  obtained  by  applying  the  corollary  to  Lemma  4.6.  Specifically, 
for  a£L  and  /9GL'nVP(G),  if  ot=**P  holds  in  G,  then  a\=* /3  holds  in  G.  Thus,  f=*  (L)  = 
L'nVP(G),  so  f=*  is  regularity-preserving.  □ 
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Lemma  4.15  Relation  *-  is  regularity-preserving. 

Proof.  Let  G = (V,  T,P,S ) be  a grammar,  a a terminal  symbol  in  T , and  L an  arbitrary 
regular  subset  of  VP(G).  Since  regular  languages  are  closed  under  concatenation,  La  is  a 
regular  language.  However,  La  may  contain  some  strings  which  are  not  viable  prefixes  of  G. 
This  is  rectified  by  intersecting  La  with  VP(G).  Since  regular  languages  are  also  closed 
under  intersection,  La  nVP(G)  is  regular.  Clearly,  aa  £ V*  is  contained  in  La  nVP(G)  if 
and  only  if  aEL  and  aa  GVP(C)  (i.e.,  a*-a  aa  holds  in  G).  Thus,  «-0  (L)  = La  nVP(G),  so 
I is  regularity-preserving.  □ 

Theorem  4.16  Let  G = (V,  T,P,S ) be  an  arbitrary  grammar  and  let  x be  an  arbi- 
trary string  over  T.  Then  PVPlr(C,i)  and  VPi^G,*)  are  regular  languages. 

Proof.  Applying  Lemmas  4.14  and  4.15  and  noting  that  PVPLn(<j,e)  = {e}  is  regular,  the 
theorem  is  proven  by  induction  on  len(z).  □ 

Discussion 

A simple  description  of  general  left-to-right  bottom-up  recognition  was  presented.  The 
General_LR  recognition  scheme  was  derived  from  General_RR  by  defining  the  inverses  of 
=>r  and  I,  restricting  them  to  VP(G'),  reversing  the  direction  in  which  the  input  string  is 
scanned,  and  manipulating  some  relational  expressions.  The  two  inverse  relations,  f=  and 
preserve  regularity.  Thus,  the  essence  of  general  left-to-right  bottom-up  recognition  was  cap- 
tured in  terms  of  computing  the  images  of  regular  subsets  of  VP(G)  under  these  relations. 

Together,  the  results  in  Chapters  III  and  IV  provide  a succinct  and  elegant  characteri- 
zation of  general  context-free  recognition.  This  was  accomplished  by  starting  from  two 
binary  relations  on  strings  and  applying  basic  set-theoretic  concepts.  There  was  no  need  to 
resort  to  automata,  although  automata  are  certainly  useful  for  implementing  the  abstract 
recognizers.  In  short,  the  formal  development  contained  in  these  two  chapters  provides  a 
framework,  founded  on  a minimal  number  of  kernel  concepts,  within  which  the  intrinsic  pro- 
perties of  general  canonical  context-free  recognizers  may  be  further  investigated. 
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The  denotations  “RR”,  “LL”,  and  “LR”  that  pervade  Chapters  HI  and  IV  were  sug- 
gested by  Knuth  [28]  where  the  following  deterministic  context-free  grammar  classes  and  the 
methods  of  their  analysis  are  enumerated: 

RR(&)  — scan  from  right  to  left,  deduce  rightmost  derivations; 

LL(Jfc)  — scan  from  left  to  right,  deduce  leftmost  derivations; 

LR(lfc)  — scan  from  left  to  right,  deduce  reversed  rightmost  derivations;  and 
RL(Ar)  — scan  from  right  to  left,  deduce  reversed  leftmost  derivations. 

Here,  k >0  indicates  the  length  of  lookahead  strings  used.  Note  that  the  use  of  these  denota- 
tions is  meant  to  evince  a generalization  of  the  respective  parsing  methods  rather  than  a gen- 
eralization of  the  grammatical  classes.  A corresponding  General—RL  recognition  scheme  is 
not  included  here.  To  mesh  with  the  other  recognition  schemes,  it  would  utilize  the  and 
•*-  relations  defined  in  terms  of  GR . Images  of  regular  subsets  of  VS(<j)  under  these  relations 
would  be  tracked  by  GeneraLRL  as  an  input  string  is  scanned  from  right  to  left. 

The  GeneraLRR  recognition  scheme  was  developed  primarily  as  a stepping  stone  to 
General_LL  and  GeneraLLR.  GeneraL-RR  is  given  little  attention  in  the  remaining 
chapters.  Consequently,  VPlr(G,x)  (resp.  PVPlr(G,x))  is  simplified  to  VP((7,x)  (resp. 
P\T(G',x)).  Similarly,  VS(G,x)  (resp.  PVS(G,x))  is  used  to  denote  VSll(G',x)  (resp. 
PVSl l(G,x)). 


CHAPTER  V 

ON  EARLEY’S  ALGORITHM 


In  this  chapter,  Earley’s  general  context-free  recognizer  is  examined  and  its  relationship 
to  the  GeneraLLR  and  General_LL  recognition  schemes  is  ascertained.  In  particular,  a 
modified  version  of  Earley’s  recognizer  is  presented  which  builds  a state-transition  graph  in 
addition  to  the  state  sets  that  are  constructed  by  Earley’s  original  algorithm.  Analyses  of 
certain  properties  of  the  resulting  STG  reveal  parallels  between  Earley’s  algorithm  and  the 
GeneraLLR  and  GeneraLLL  recognizers.  Throughout  this  chapter,  an  arbitrary  reduced  $- 
augmented  grammar  G = (V,  T,P,S)  and  an  arbitrary  string  w=a1a2  ■ ■ ■ an+1,  n >0, 
a,-  € T \{$}  for  1 <t  <n,  an+1=$,  are  assumed. 

Earley’s  General  Recognizer 

Recall  that  A— /9  is  an  item  of  G whenever  there  is  a production  of  the  form 
A— ra/3  in  P.  The  bracketed  pair  [A— ra»P,j]  where  A—t-a-f}  is  an  item  of  G and  j is  a 
natural  number  is  called  an  Earley  state  of  G (or  state,  for  short).  Earley’s  algorithm,  in 
recognizing  w with  respect  to  G,  constructs  a sequence  of  sets  of  Earley  states  Si , 
0<t  <n+l.  The  sets  are  constructed  in  order  of  increasing  » beginning  with  S0.  Thus,  set 
Si  is  constructed  only  after  all  sets  Sj  with  0 < i are  in  place. 

Each  5t-  is  initialized  to  a finite  set  of  states  which  we  denote  by  basis(S,).  For 
1 <i  <n+l,  basis(Sf)  is  determined  from  5,_j  and  <q;  50  is  a special  case. 

{ {[5'-^«5$,0]}  if  i =0 

basis(5,)  — | | [A^a.aiAi]  G5.-J  if  l<t  <n+l 

The  lone  state  in  basis(50),  [S'— * •5$,0],  is  called  the  initial  state;  it  will  be  denoted  by  s0. 
For  i >0,  basis(5,)  is  constructed  by  the  Earley  Scanner  function. 
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A state-set  closure  function,  informally  called  S_Closure,  completes  the  construction  of 

a set  of  Earley  states.  That  is,  for  0<»  <n,  S_Closure  is  applied  to  basis(5,)  to  produce  S{. 

Since  Sn+1  = basis(5n+1),  there  is  no  need  to  apply  S_Closure  to  basis(Sn+1). 

(S_Closure(basis(5,  ))  if  0 < j < n 
basis(5,)  if  i = n +1 

For  0<i<n,  S_Closure(basis(S,))  computes  5,-  as  the  smallest  set  of  Earley  states  which 
satisfies  the  following  three  rules. 

(1)  Every  state  in  basis^,)  is  in  5,-. 

(2)  If  [A—*-<yBf3,j]  is  in  Sit  then  for  all  B— *uEP,  [B— ► t]  is  in  5,-. 

(3)  If  [B— ► w*,y]  is  in  5,-,  then  for  all  [A—+a»Bf3,k]  in  Sj,  [A— +aB»P,  k]  is  in  5,-. 

The  states  added  to  5,-  by  rules  (2)  and  (3)  above  correspond  to  the  states  that  are  spawned 
by  the  Earley  Predictor  and  Completer  functions,  respectively.  Thus,  S_Closure  embodies 
both  of  these  functions.  The  number  of  states  added  to  51,-  during  its  closure  is  finite;  after 
all  possible  states  are  added,  we  say  that  5,-  is  closed. 

Figure  5.1  presents  Earley’s  general  context-free  recognizer  in  terms  of  the  notation 
defined  above.  A Scanner  function  is  assumed  which  computes  basis(5f+1)  from  5,-  and  a,+1, 
0 < i < n . 


function  Earley  (G  ={V,  T,P,  S );  w E T*) 

1 1 w=axa2  ■ ■ ■ un+1,  n >0,  a,GT\{$},  l<t  <n,  a„+i=$ 
basis^o)  :=  {[S'—*  •5$,  0]} 
for  i :=  0 to  n do 

5,-  :=  S_Closure(basis(5,)) 
basis(5,+1)  :=  Scanner(5,-,  oi+1) 
if  basis(5,+1)  = 0 then  Reject(tr)  fi 
od 

Sn-H  :=  basK^n+l) 

Accept(tn) 


Figure  5.1  — Earley’s  General  Recognizer 
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For  0<i<n+l,  if  [A  j]  £5,-,  then  0<j  < i.  Due  to  this  property,  the  second 

component  of  an  Earley  state  is  called  a back-pointer . Thus,  note  that  in  rule  (2)  of  the 
S_Closure  function,  0 <j  <i  holds,  and  in  rule  (3),  0 <k  <j  <»  holds. 

For  0<i  <n+l,  t:w  GPREFIX(G)  if  and  only  if  S',-  5^0.  This  is  just  a formal  way  to 
state  that  Earley’s  algorithm  is  a correct-prefix  recognizer.  Moreover,  w £L(G)  if  and  only  if 
Sn+1={[S'— ► S$»,0]}.  Conversely,  tc^L(G)  if  and  only  if  3t,0<i<n,  such  that 

Vj , 0 < j <i,  Sj  7^0  and  V/, : < j <n+l,  S;-  =0.  In  this  case,  j:u>  is  the  longest  prefix  of  w 
such  that  i:w  GPREFK(G)  holds. 

The  correctness  of  Earley’s  algorithm  is  based  on  the  criteria  which  places  a state  in  a 
particular  state  set  [6],  In  that  regard,  the  following  statements  are  made. 

Fact  5.1  For  0<y<i<n+l  and  A—raftkzP,  [A— ► j]  GS1,-  if  and  only  if 

Oi=>*aj+1aj+ 2 • a,-  and  S'=$*6Ai  hold  in  G for  some  <5,7£  F*  such  that  6=^*a1a2  • ■ ■ a;- 

holds  in  G . □ 

Facts  5.2  and  5.3  below  ascribe  bottom-up  and  top-down  interpretations,  respectively, 
to  Fact  5.1. 

Fact  5.2  For  0<y<i<rj+l  and  A—ra/3(E.P,  [A— ► £*•/?, y] GS1,-  if  and  only  if 

a=^r*aJ+1aJ+2  • • • a,  and  S'=*?6Ay  hold  in  G for  some  (SGF*  and  y G T*  such  that 
<5=t-r* ala2  ■ ■ ■ Qj  holds  in  G.  □ 

Note  that  <5£VP(G,y:u;)  and  &>;GVP(G,  j :iv).  We  say  that  [A—HX*fi,j]kzSi  is  valid 
for  <5caGVP(G,  t:tn);  in  particular,  [A— ►•»/?,.;]  £S;-  is  valid  for  6&VP(G , j:w).  If  a^e  also 
holds,  then  we  say  that  [A— roe/3,  j]  £5,-  properly  cuts  fo£VP(G,  i:w). 

Fact  5.3  For  0<j<*<n-lT  and  A—rafikiP,  [A— ► «•/?,  j]  GSj  if  and  only  if 

a=4*u;+1a;+  2 • ■ • a,  and  S'  =**0^2  • • • OyA<5hold  in  G for  some  <5£  V*.  □ 

In  this  case,  note  that  {A6)R  GVS(G,j:u>)  and  (06)R  GVS(G,  i:w).  We  say  that 
[A— ► a • /?,;]£  S,-  is  valid  for  ((35)R  £VS(G,  i:u>);  in  particular,  [A  — ► • afi,  j]  G S;-  is  valid  for 
{A6)R  GVS {G,j:w). 
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A Modified  Earley  Recognizer 

A modified  version  of  Earley’s  recognizer,  called  Earley',  is  described  next.  Earley' 
differs  from  Earley’s  algorithm  in  that  it  constructs  a state-transition  graph.  The  STG  con- 
structed by  Earley'  is  denoted  by  Gei  = ( Qei , V,  6Ei).  The  states  in  Qei  are  the  Earley  states 
that  are  generated  by  Earley’s  algorithm.  The  state  transitions  in  6ei  are  described  below. 

In  recognizing  w with  respect  to  G,  Earley'  builds  the  same  sequence  of  state  sets  as 
Earley’s  algorithm.  In  addition,  a sequence  of  state-transition  sets,  viz.,  E{  for  0<*<n+l, 
is  constructed.  These  sets  are  also  constructed  in  order  of  increasing  i.  In  particular,  is 
constructed  concurrently  with  For  0<t<n+l,  every  transition  in  Ei  is  of  the  form 
(s ,X,t)  where  s G Sj  for  some  j,  0 <j  < t , X £VU {e},  and  t G . 

A particular  set  of  state  transitions  Ei  is  constructed  analogously  to  5,-.  That  is,  (1)  E{ 
is  initialized  to  a finite  set  of  transitions  denoted  by  basL^E,),  and  (2)  a transition-set  closure 
function,  called  E_Closure,  is  applied  to  basis(£',)  to  complete  the  construction  of  E For 
0<i  <n+l,  basis^,)  is  defined  as  follows. 

f 0 if  * =0 

basis(Ei ) ~ | {(s , m , t ) | s = [A  -+a • a,-/?,  j]  G S.-.j,  t = [A  -+aa,-  • /?,  j]  G5,}  if  1 < i < n +1 

Note  that  basis^,)  where  i > 0 is  determined  from  5,-_ v and  a,-;  basis(E'o)  is  a special 

case.  For  t > 0,  the  transitions  in  basis^,)  may  be  installed  by  a slightly  modified  Earley 
Scanner  function. 

For  0<i  <n,  the  set  closure  function  E_Closure  is  applied  to  basis^,-)  to  complete  the 

construction  of  E,-.  Similar  to  5,,^,  En+1  = bas^^^.!). 

(E_Closure(basis(£,,■))  if  0 < j < tj 
basis^,- ) if  i = n +1 

For  0<i<7j,  the  set  computed  by  E—Closure^as^E,))  is  the  smallest  set  of  transitions 
which  satisfies  the  following  three  rules. 


(1) 

(2) 
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Every  transition  in  basis(E,)  is  in  E 

If  s =[A —*  a • B ft,  j]  is  in  Si:  then  for  all  B—rcoEP,  ( s,e,t ) is  in  Ei  where 
t = [B  — * ♦ oj,  i ] G Si . 

(3)  If  [B— is  in  S,-,  then  for  all  s =[A—*-a'B (3,  k]  in  Sj,  ( s,B , t)  is  in  E{  where 
t=[A~*aB>0,k]eSi. 

Transitions  added  to  Ei  by  rules  (2)  and  (3)  above  correlate  closely  with  the  states  that  are 
generated  by  the  Predictor  and  Completer  functions,  respectively. 

A high-level  description  of  Earley'  is  given  in  Figure  5.2.  In  that  figure,  we  assume  (1)  a 
generalized  Closure  function  which  concurrently  constructs  5,-  and  Eit  0<«<n,  after  they 
are  initialized  to  basis(5,)  and  basis(i?,),  respectively,  and  (2)  a modified  Scanner  function 
which  computes  basis(5i+1)  and  basis^,-.,.!),  0 < * < n . The  correctness  of  Earley'  follows 
from  the  well-established  correctness  of  Earley’s  original  algorithm. 


function  Earley' ( (7  =(F,  T,P,S)\  w G T*) 

1 1 w=a1a2  ■ ■ ■ aB+1,  n >0,  a,Gr\{$},  l<i<n,  an+1=$ 
basis(50),  basis^o)  :=  {[S'— >•  *5$,  0]},  0 

for  t :=  0 to  n do 

(S’,-,  E^  :=  Closure(basis(5,),  basis^,)) 

(basis(5,+1),  basis(£',+1))  :=  Scanner(5<(  a,+1) 
if  basis(S’,+1)  = 0 then  Reject(ic)  fi 

od 

Sn+V  En+i  ■=  basis(5n+i),  bas\s(En+1) 

Accept(ie) 

end 


Figure  5.2  — A Modified  Earley  Recognizer 


The  STG  GE>  is  informally  called  the  Earley  state  graph.  When  the  Earley  state  graph 
is  complete,  GE<  = (QE,,  V,dE!)  where  QE,  = U S{  and  SF,  = U E:. 

0<»<n+l  0<<<n+l 

As  every  state  in  GEi  is  reachable  from  the  initial  state  [5'— ► *5$, 0],  s0  is  also  called 
the  root  of  Gei.  A path  in  GE<  which  begins  at  the  root  is  called  a rooted  path  in  Gei. 
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Earley’s  Algorithm  and  Viable  Prefixes 

Let  GEi  = {Qe',  V,6ei)  be  the  Earley  state  graph  that  results  from  applying  Earley'  to 
G and  w.  In  this  section,  the  strings  over  V that  are  spelled  by  rooted  paths  in  GEi  are 
analyzed.  It  transpires  that  the  string  spelled  by  an  arbitrary  rooted  path  in  Gei  is  a viable 
prefix  of  G.  Moreover,  the  string  spelled  by  a rooted  path  in  GEi  which  terminates  at  a state 
in  Sit  0 < t < n +1 , is  a member  of  VP(G,::u>).  The  analysis  that  follows  presumes  a 
bottom-up  interpretation  of  Earley’s  algorithm  as  exemplified  by  Fact  5.2. 

Lemma  5.1  For  0<y<i<n+l  and  A— kx/3£P,  let  s —[A—ra>  P,  j]  be  an  Earley 
state  in  5,-.  Then  every  path  of  length  len(a)  to  s in  GE<  spells  a. 

Proof.  The  proof  is  by  induction  on  len(a)  = m. 

Basis  ( m =0).  In  this  case  a=e,  so  j — i and  s =[A— ► •/?,  j],  The  unique  path  of  length  0 to 
s in  Gei  is  denoted  by  (s).  By  definition,  this  trivial  path  spells  e. 

Induction  (m  > 0).  Since  len(a)  > 0,  ot=a!X  for  some  ol  6F*  and  X£V,  i.e., 
s = [A  — ► cJX • ft,  j ] . Thus,  s was  added  to  Si  by  either  the  Scanner  or  the  Completer.  In 
either  case,  every  transition  to  s in  Gei  is  of  the  form  ( r,X,s ) such  that 
r=\A—*of»Xfi,j\€.Sii  for  some  i1,  j<i'<i.  Choose  one  such  r.  By  the  induction 
hypothesis,  every  path  of  length  len(o/)  to  r in  GEi  spells  ol.  Consequently,  every  path  of 
length  len(a)  to  s in  Gei  spells  o!X  =a.  □ 

Corollary  For  0<y<:<n+l  and  A—rOifiGiP , let  s = \A  — ► a • /?, _/]  be  an  Earley 
state  in  5,-.  Then  every  path  of  length  len(a)  to  s in  GE<  begins  at  [A  — ► *aP,  j]  €5;-.  □ 

Lemma  5.2  Let  p =(s0>  si<  • • • >sm)’  m be  a rooted  path  in  GE>  such  that  p spells 
7£F*  a.nd  sm  =[A  — ►£*•/?,_;']  G5,-  for  some  A-*-a/3(zP  and  i,j,  0 <j<t<n.  Then 
7£VP(G,  i:w)  and  [A  — ►(>•/#,  j]  is  valid  for  7. 

Proof.  By  Lemma  5.1  and  its  Corollary,  7 =6a  for  some  66  E*;  we  show  by  induction  on  m 
that  S'  =>ISAy  and  6=»,*a1a2  • • • o;-  hold  in  G for  some  y 6 T*. 
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Basis  (m  =0).  Thus,  : =0,  sm  =s0  = [5'— ► • £$,  0]  G50,  and  7=e.  The  consequent  trivially 
holds  in  this  case. 

Induction  ( m > 0).  Two  cases  are  analyzed  based  on  whether  or  not  a=e. 

Case  (i):  a — e.  In  this  case,  j =i  and  sm  was  added  to  by  the  Predictor.  Thus,  = 
[B— KT»AT,jr]  G5,-  for  some  B—kjAt&P  and  j1,  Let  p'  = (s0,s1;  ■ • ■ >sm- 1)- 

Clearly  p'  also  spells  7.  By  Fact  5.2,  <7=>,*  a;/+1a;»+2  •••<*,•  holds  in  G.  By  the  induction 
hypothesis,  7=&7  for  some  SEV*  such  that  S'=*?6By  and  <5=^,*aja 2 • • ■ a;-/  hold  in  G for 
some  yGT*  That  is,  7GVP(G,i:u>)  and  [B— kt»At,  j1]  GS,  is  valid  for  7.  Since 
5By  =>,* SoAxy  holds  in  G for  some  1 G T*,  [A  — ► •/?, 1]  GS,-  is  also  valid  for  7GVP(G,  »:«;). 
Case  (ii):  Therefore,  sm  was  added  to  Si  by  either  the  Scanner  or  the  Completer,  i.e., 

a=o!X  for  some  cdtV*  and  XGF  Let  sm_1  = [A—ro!»X/3,j]ESii  for  some  i',  j <»'<», 
and  let  p'=(s0,  By  Lemma  5.1,  p'  spells  do!  for  some  iGF*  such  that 

do!X =da—^.  By  Fact  5.2,  o/=**a;+1aJ+2  ■ • • a,-/  holds  in  G.  Therefore,  by  the  induction 
hypothesis,  S'=*fdAy  and  d=^Ia1a2-  ■ • a;-  holds  in  G for  some  yGT*  That  is, 
<5cr,GVP(G,  i':w)  and  [A  y]  GSp  is  valid  for  (SaL  If  XG  T,  then  X = ai  and  j'  = t—  1. 

If  X£N,  then  X=4,*o,/+1a,/+2  • • • a,-  holds  in  G.  Consequently,  7GVP(G,  i :w)  and 
[A  — ► a • /?,  j]  G 5,-  is  valid  for  7.  □ 

Corollary  Let  p =(s0,  Sj,  . . . ,sm),  m >0,  be  a rooted  path  in  Gei  such  that  p spells 
7 G:V*  and  sm  = [yl  — ► £*•/?,  j]  Gbasis(5,)  for  some  A—rafKEP  and  i,j,  0<y<»<n+l. 
Then  7GPVP(G,  i:w). 

Proof.  If  m =0,  then  7 = e and  t =0.  By  definition,  eGPVP(G,0:u;).  Otherwise,  suppose 
that  m > 0.  Since  sm  Gbasu^Sj),  the  last  transition  in  p is  on  a,-  G T,  i.e.,  : > 0 and  7=Ya,- 
for  some  YGP*.  Therefore,  7GPVP(G,  i:w).  □ 

The  next  lemma  provides  the  converse  to  Lemma  5.2. 

Lemma  5.3  Let  7 be  a string  in  VP(G,  i:w)  and  let  [A— ► a*/?,  j]  G5,-  be  a state  which 
is  valid  for  7 for  some  A— rafKzP  and  i,j,  0 <»  <n+l.  Then  there  exists  a rooted  path 
in  Gei  to  [A  — ► a ♦ /?,  j]  G 5,-  which  spells  7. 
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Proof.  This  lemma  appears  to  be  rather  more  difficult  than  Lemma  5.2  to  prove  rigorously. 
In  lieu  of  a formal  proof,  an  intuitive  argument  is  given.  First  the  following  observations  are 
made. 

(1)  Every  state  which  is  valid  for  7 is  in  S’,-.  Otherwise,  a contradiction  of  Fact  5.2 
would  result. 

(2)  If  7 7^6,  then  there  is  some  state  s G5,-  such  that  s is  valid  for  7 and  s properly 
cuts  7.  In  particular,  Earley  states  that  are  added  by  the  Scanner  or  Completer 
properly  cut  the  viable  prefixes  that  they  are  valid  for. 

(3)  If  7t^c,  then  for  each  state  s ESi  which  is  valid  for  7 there  is  a state  r GS,  such 
that  (i)  r is  also  valid  for  7,  (ii)  r properly  cuts  7,  and  (iii)  there  exists  a path  in 
Gei  from  r to  s which  spells  e. 

Given  these  observations,  an  informal  inductive  argument  proceeds  as  follows  where  the 
induction  is  on  len(7). 

Basis  (len(7)=0).  For  each  state  sES0  which  is  valid  for  eEVP(G,0:w),  there  exists  a 
rooted  path  in  Gei  to  s which  spells  e. 

Induction  (len(7)  > 0).  Let  7=7/X  for  some  7 1 EV*  and  X E V.  By  points  (2)  and  (3)  above, 

we  may  assume  that  [A  — ►a*/?,  j]  GS)  properly  cuts  7,  i.e.,  a=cdX  for  some  o/EV*.  Let 

s=[A-+a'A'’*/?,j]GSi.  For  every  i\  j<i'<i,  for  which  <y=^*oJ+1ay+2  • • • a,»  and 

X =$?  ai’+iai'+2  ‘ ‘ ‘ fl,-  hold  in  G,  [A—roI»Xj3,j]  is  in  5,i.  Pick  one  such  i'  (there  must  be  at 

least  one)  and  let  r = [A— *-od‘X/3,j]  GS,/.  By  the  induction  hypothesis,  YCVP(G,  i':w),  r is 

valid  for  Y>  and  there  exists  a rooted  path  to  r in  GE>  which  spells  Y-  When  s is  added  to  S,- 

by  either  the  Scanner  or  Completer,  the  transition  ( r,X,s ) is  installed  in  Gei.  Therefore, 

there  exists  a rooted  path  in  GE<  to  s which  spells  7.  □ 

Theorem  5.4  For  0<i  <n+l,  define  Gfi  ,■  = ( U 5,-,  V,  U £,)  and  let  Mpt  ,■  — 

’ o<;<»  3 0<j<i  3 

( Gei  s0,  S’,)  denote  an  NFA.  Then  L (Mei  ,)  =VP(G,i:u;). 

Proof.  This  theorem  follows  from  Lemmas  5.2  and  5.3.  □ 
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Corollary  For  0 < i < n -Fl , define  GF<  ,•  h = ( U 5,-  Ubasis(5t),  V, 

’ ’ o <j<i  J 

U E:  Ubasis(£',))  and  let  ME<  ,•  h = (GE<  ,•  b,  s0,  basis(5,-))  denote  an  NFA.  Then 
o <j  < * 

=PVP(<7,  i:u>).  □ 

Theorem  5.4  and  its  Corollary  establish  a direct  relationship  between  Earley'  and  the 
General_LR  recognition  scheme.  Indeed,  Earley'  prescribes  one  possible  approach  to  realiz- 
ing an  implementation  of  General_LR.  Note  that  the  foregoing  analysis  of  Gei  provides  a 
constructive  proof  that  for  arbitrary  x € T*,  VP(G',  x)  and  PVP(G,  x)  are  regular  languages. 

Earley’s  Algorithm  and  Viable  Suffixes 

The  last  section  considered  strings  in  V*  that  are  spelled  by  rooted  paths  in  Gei.  The 
string  spelled  by  a path  in  Gei  is  determined  directly  from  the  grammar  symbols  that  label 
the  transitions  in  that  path.  In  this  section,  another  string  over  V is  associated  with  a path 
in  Gei,  viz.,  a string  that  is  derived  from  the  states  in  that  path.  Specifically,  the  state 
derivative  of  a path  in  Gei  is  defined  recursively  by  the  state-derivative  function  given  in 
Figure  5.3. 


function  state-derivative  ((s0,  s1;  . . . , sm)) 

//  (so>  «i , ■ ■ ■ ,sm),  m >0,  is  a path  in  GEi. 
if  m =0  then  //  Let  s0  = [A  — ►a*/?,  j], 
return^) 

else  if  s0  = [A— kx'X/3,  j]  and  Sj  = [A—raX>0,j]  then  //  (s0lX, 
return(state_jderivative((sj,  s2,  . . . , sm))) 
else  //  Let  Sq  = [A—*OfB/3,j]  and  Sj  = [B— *•  i]. 

return(/?"  (state-derivative  ((sj,  s2,  . . . , sm  )))) 

fi 

end 


Figure  5.3  — The  Definition  of  the  State  Derivative  of  a Path 


Again,  let  GE<  — (QE',  F,  bE>)  be  the  Earley  state  graph  that  results  from  applying  Ear- 
ley'  to  G and  w.  It  transpires  that  the  state  derivative  of  an  arbitrary  rooted  path  in  Gei  is 
a viable  suffix  of  G.  Moreover,  the  state  derivative  of  a rooted  path  in  GE>  which  terminates 
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at  a state  in  5,-,  0<t  <n+l,  is  a member  of  VS(G?,  i:w).  The  analysis  that  follows  adopts 
the  top-dow'n  interpretation  given  to  Earley’s  algorithm  by  Fact  5.3. 

Lemma  5.5  Let  p =(s0,sv  . . . ,sm ),  m >0,  be  a rooted  path  in  Gei  such  that  7G  V* 
is  the  state  derivative  of  p and  sm  = [A  — 7]  GS,-  for  some  A—+a/3(zP  and  i,j, 
0<j<*<n+l.  Then  7GVS(G,  i:w)  and  [A  — Ki*/?,  j]  G5,-  is  valid  for  7. 

Proof.  We  show  that  7 =(/35)R  for  some  6GT*  such  that  S'=**a xa2  • • • HjAS  holds  in  G. 
The  proof  is  by  induction  on  m. 

Basis  (m  =0).  Thus,  1=  0,  sm  =s0  = [5'— ► *5$,  0]  G50,  and  7=$5.  By  definition, 
$5  GVS(G,0:w)  and  s0  is  clearly  valid  for  $5. 

Induction  ( m > 0).  Two  cases  are  analyzed,  based  on  whether  or  not  a=e. 

Case  (i):  a=e.  In  this  case,  j =i  and  sm  = [A—+  • ft,  t]  was  added  to  5,-  by  the  Predictor.  Let 
sm_1  = [B—KJ»AT,jr]ESi  for  some  B—ktAt&P  and  j',  0<j'<t.  By  definition,  the  state 
derivatives  of  p'=(sQ,sv  . . . ,sm_1)  and  p are  ( At5)r  and  (/ 3tS)r  =7,  respectively,  for  some 
6EV*.  By  Fact  5.3,  cr=W* aj'+iaj'+2  ' ' ' ai  holds  in  G.  By  the  induction  hypothesis, 
S'=**a1a2  ■ ■ ■ ajiBS  holds  in  G.  That  is,  (At5)r  GVS(G,  i:w)  and  [B— kt*A  t,  j']  GS,-  is 
valid  for  (At6)r.  Clearly,  5'^‘aja 2 • • • OjArS  also  holds  in  G.  Thus, 
[fir5)R  =7GVS (G,  i:w ) and  [A—*  •/?,  j]  is  valid  for  7. 

Case  (ii):  Thus,  sm  was  added  to  5,-  by  either  the  Scanner  or  the  Completer,  i.e., 

ot=o?X  for  some  ol G V*  and  A^GF.  Let  sm_1  = [A—*-al  »Xp,  for  some  i1,  j<i'<i 

and  let  p'=(s0,  sv  . . . ,sm_x).  The  state  derivatives  of  p'  and  p are  [XP5)R  and  (#5)^=7, 
respectively,  for  some  <5G  V*.  By  Fact  5.3,  o/=^*aj+1aJ+2  ' ’ ' a,-/  holds  in  G.  By  the  induc- 
tion hypothesis,  S'=**aja2  • • • a;A<5  holds  in  G,  so  (Xf38)R  GVS(G,  i':w)  and 
[A—ta1  •X/l,  j](zSii  is  valid  for  [Xft8)R . If  X G T,  then  X = a,  and  i'  — i— 1.  If  XEN,  then 
X=>*a,(+1a,/+2  • • • a,-  holds  in  G.  In  either  case,  a=>*ay+1ay+2  • • • a,  holds  in  G.  There- 
fore, {pS)R  =7GVS(G*,  i:w)  and  [A— ► «•/?,  j]  GS,-  is  valid  for  7.  □ 
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Corollary  Let  p =(s0,  sv  . . . ,sm),  m >0,  be  a rooted  path  in  GEi  such  that  7G  V*  is 
the  state  derivative  of  p and  sm  = [A  — ► «•/?,  j]  Gbasis(5,)  for  some  A— ►a/?GP  and  i,j, 
0 <;<*<«+!•  Then  7GPVS((7,  i:w). 

Proof.  If  m =0,  then  1 =0,  and  sm  = s0  = [S7— ► *5$, 0]  Gbasis(50).  Thus,  the  state  deriva- 
tive of  p is  $5=7  which  is  in  PVS(G,0:u;)  by  definition.  If  m>  0,  then  t > 0, 
sm-i  = [A—*QJ'aif3,j](zSi_l,  and  sm  —[A— *-ofai  •/?,  j]  for  some  cdGF*,  i.e.,  ot=afai.  The 
state  derivatives  of  p'=(s0,  s1,  . . . ,sm_l)  and  p are  {aiP6)R  and  (/?<5)fl  =7,  respectively,  for 
some<5GT*  By  Lemma  5.5,  GVS(G,  » — l:u>),  so  7GPVS(G,  i.w).  □ 

The  next  lemma  provides  the  converse  to  Lemma  5.5. 

Lemma  5.6  Let  7 be  a string  in  VS(G,  i:w ) and  let  [A  — ►a*/?,  j]  G5,-  be  a state  which 
is  valid  for  7 for  some  A—*o:f3£P  and  1,  j,  0 <j  <t  <n-fl.  Then  there  exists  a rooted  path 
in  Gei  to  [A— *-Off3,j]  with  state  derivative  7. 

Proof.  A rigorous  proof  of  this  lemma  has  so  far  eluded  us.  Consequently,  a very  informal 
intuitive  argument  is  given  instead.  A more  convincing  proof  is  left  for  future  work. 

Observe  that  the  basic  result  provided  by  Lemmas  5.2  and  5.3  is  a graphical  interpreta- 
tion of  Fact  5.2  in  terms  of  certain  properties  of  Gei.  In  turn,  the  goal  of  Lemmas  5.5  and 
5.6  is  a graphical  interpretation  of  Fact  5.3  in  terms  of  certain  other  properties  of  GEt. 

Consider  VP(<5,i:tf)  and  VS(G,  i:w)  for  some  t,  0<t  <n-|-l.  The  previous  section 
established  that  7G  V"*  is  a member  of  VP(G!,  i:w)  if  and  only  if  there  is  a rooted  path  in  Gei 
to  some  state  in  5,-  which  spells  7.  Lemma  5.5  showed  that  7G  V*  is  a member  of 
VS((j,  i:w)  if  there  is  a rooted  path  in  GE>  to  some  state  in  5,-  with  state  derivative  7.  It 
would  be  rather  counterintuitive  and  at  variance  with  Fact  5.3  if  the  converse  to  the  previ- 
ous statement  did  not  also  hold.  In  fact,  such  a result  would  appear  to  subvert  the  generality 
of  Earley’s  algorithm.  □ 

In  contrast  to  the  case  with  GeneraL_LR,  Lemmas  5.5  and  5.6  establish  a more  covert 
relationship  between  Earley'  and  GeneraLLL.  This  is  in  keeping  with  the  relative  complex- 
ity of  the  definitions  of  the  spelling  of  a path  and  its  state  derivative. 
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Discussion 

A graphical  variant  of  Earley’s  algorithm  was  examined  within  the  framework  esta- 
blished in  the  previous  two  chapters.  In  the  process,  some  properties  of  Earley’s  algorithm 
were  identified  and  the  efficacy  of  the  General_LR  and  General_LL  approaches  to  general 
recognition  was  established.  Earley’s  algorithm  is  an  excellent  vehicle  for  demonstrating  the 
effectiveness  of  General_LR  and  General_LL  given  that  it  is  so  well-known  and  highly- 
regarded. 

The  analyses  contained  in  the  previous  two  sections  illustrated  how  the  sets  of  viable 
prefixes  (resp.  viable  suffixes)  tracked  by  General_LR  (resp.  General_LL)  are  explicitly 
represented  in  the  state-transition  graph  that  is  constructed  by  Earley'.  As  Earley'  is  a 
direct  descendant  of  Earley,  it  is  fair  to  conclude  that  these  same  sets  are  represented  impli- 
citly in  the  Earley  state  sets  that  are  constructed  by  Earley’s  original  algorithm.  By  viewing 
Earley’s  algorithm  from  this  novel  perspective,  its  operation  and  correctness  has  been 
explained  at  a level  of  abstraction  that  is  closer  to  that  necessary  for  capturing  the  essence  of 
general  canonical  recognition. 

The  structure  of  Gei  exhibits  how  Earley'  subsumes  both  the  General_LR  and 
General_LL  recognition  schemes.  Clearly,  Earley'  embodies  General— LR  considerably  more 
directly  than  General— LL.  In  light  of  this,  it  is  perhaps  more  apt  to  view  Earley’s  algorithm 
as  a general  bottom-up  recognizer. 

Practical  aspects  of  the  General— LR  recognition  scheme  are  examined  further  in  the 
next  chapter  and  Chapter  VII  extends  it  into  a general  parser.  Thus,  this  chapter  is  transi- 
tional in  that  it  bridges  the  abstract  treatment  of  general  recognition  presented  in  Chapters 
III  and  IV  with  the  concrete  treatment  of  GeneraLLR  contained  in  Chapters  VI  and  VII. 
Attempts  at  deriving  a general  parser  from  General— LL  were  unsuccessful.  Thus,  an  investi- 
gation of  the  practical  potential  of  General—LL  is  left  for  future  work. 


CHAPTER  VI 

A GENERAL  BOTTOM-UP  RECOGNIZER 


In  this  chapter,  a general  bottom-up  recognizer  that  is  directly  based  on  the 
GeneraLLR  recognition  scheme  is  presented.  In  particular,  the  algorithm  constructs  a graph 
in  such  a way  that  the  regular  sets  of  viable  prefixes  manipulated  by  General_LR  are 
represented  in  this  graph.  Aside  from  complications  that  can  arise  due  to  nullable  nontermi- 
nals, the  recognizer  is  extended  into  a general  parser  rather  seamlessly  (parsing  is  the  subject 
of  the  next  chapter).  Thus,  in  light  of  the  algorithm’s  practical  potential,  several  implemen- 
tation issues  are  discussed.  Throughout  this  chapter,  an  arbitrary  reduced  $-augmented 
grammar  G = ( V,  T,P,S)  and  an  arbitrary  string  w =ala2  • • • an+1,  n >0,  a,  G T \{$}  for 
l<i  <n,  an+1=$,  are  assumed. 

Control  Automata  and  Recognition  Graphs 

The  recognizer  described  in  this  chapter  constructs  a state-transition  graph  which  we 
call  the  recognition  graph.  The  correctness  of  the  algorithm  is  based  on  properties  of  this 
graph.  The  recognition  graph  is  constructed  under  the  guidance  of  an  FSA  called  the  control 
automaton.  The  control  automaton  is  determined  from  the  subject  grammar  G and  is  fixed 
throughout  the  recognition  process.  In  contrast,  the  recognition  graph  evolves  during  recog- 
nition; its  structure  is  derived  from  the  control  automaton  and  the  input  string  w. 

For  simplicity,  the  LR(0)  automaton  of  G is  used  as  the  control  automaton  for  guiding 
the  recognition  of  w with  respect  to  G;  alternative  control  automata  are  suggested  later. 
The  LR(0)  automaton  of  G is  a DFA  which  is  based  on  the  canonical  collection  of  sets  of 
LR(0)  items  of  G and  the  associated  goto  function  [4,11].  Recall  that  each  set  is  comprised 
of  kernel  and  closure  items.  The  item  S'— ► *5$  is  a kernel  item  as  are  all  items  of  the  form 
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A — ►£*•/?  such  that  With  the  exception  of  S'— ► *5$,  all  items  of  the  form  A— ► are 

closure  items. 

We  denote  the  LR(0)  automaton  of  G by  MC(G)=(I,  V,  goto,  I0, 1)  where 
I={I0,IV  . . . is  the  collection  of  sets  of  LR(0)  items.  The  “C”  subscript  is  a re- 

minder that  MC(G)  is  used  as  the  control  automaton  during  recognition.  For  convenience, 
we  assume  that  S'— ► •S'$G/0  and  S'— ► 5$*  £Im_i,  in  fact,  the  latter  assumption  implies  that 
/m_!  = {S'— ► £$•}.  A detailed  accounting  of  MC(G)  is  not  needed  to  describe  how  it  is  used 
to  recognize  strings.  However,  the  following  well-known  facts  about  MC(G)  are  useful. 

(1)  L(MC(G))  =VP(G). 

(2)  Each  Ij  €/\{/0}  has  a unique  entry  symbol  X(zV,  i.e.,  the  grammar  symbol  that 
all  transitions  to  Ij  are  made  on.  The  entry  symbol  for  Ij,  j ^0,  is  denoted  by 
entry  (Ij).  There  are  no  transitions  directed  to  /0  in  MC(G),  so  entry(/0)  is  not 
defined. 

(3)  For  Ij  E /,  (i)  if  A—*a>X^EIj,  then  A—*aX • /?£/*  where  Ik  = goto (/y,X); 

(ii)  if  A-+aX-/3eij,  then  A-+a-Xpeik  for  all  Ik  Gpred (7;-,X);  and 

(iii)  if  A-*-Of  0(zlj  and  A #5',  then  goto(4,A)  is  defined  for  all  Ik  €pred(/;-,  a). 
Automaton  MC{G ) is  also  denoted  by  M ^ if  G is  understood. 

The  precise  manner  in  which  the  recognition  graph  is  constructed  is  the  essence  of  the 
algorithm  described  in  the  next  section.  Some  general  characteristics  of  recognition  graphs 
are  described  in  the  remainder  of  this  section. 

The  recognition  graph  constructed  under  the  guidance  of  Mc  is  denoted  by 
Gr{Mc)={Q > Vjfy-  At  the  start  of  recognition,  GR(MC ) is  set  to  an  initial  configuration. 
Additional  states  and  transitions  are  added  to  Q and  6,  respectively,  as  the  recognition 
proceeds.  The  denotation  GR(MC ) is  simplified  to  Gr  whenever  the  intent  is  obvious. 

Each  state  added  to  Q during  recognition  corresponds  to  a set  of  items  /,  €:/  of  Mc, 
0<j  <m—  1,  and  a position  i in  w,  0<i<n+l.  For  position  i coincides  with 

a,-;  the  0th  position  of  w immediately  precedes  ak.  A subscript  of  j.i  is  used  to  denote  the 
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state  in  Q that  corresponds  to  /;  and  input  position  e.g.,  g-.;.  The  function  tJs.Q—+I  is 
defined  to  map  a state  in  GR  to  its  associated  set  of  items  in  Mc\  thus,  For  later 

use,  we  define  Q{  = {<7;:i  € Q},  0<i<n+l;  that  is,  Q,  consists  of  all  states  in  GR  that 
correspond  to  input  position  i. 

Similarly,  each  transition  added  to  6 during  recognition  corresponds  to  a transition  in 
Mc.  The  members  of  6 are  best  described  in  terms  of  the  mapping  6— ♦■goto  induced  by  ip 
defined  as  follows:  for  p,qEQ  and  ArGF,  {p,X,  q)&6  only  if  goto (xp{q),X)  = 4{p).  Thus, 
each  transition  in  GR  corresponds  to  the  reversal  of  a transition  in  Mc.  Consequently,  all  of 
the  transitions  out  of  a state  p E.Q  are  on  entry(t/{p )).  Valid  transitions  in  GR  are  also  con- 
strained by  input  position;  specifically,  (qk.i,X,  9j:A)€<5  implies  that  h <i.  For  0<t  <n+l, 
we  define  6,-  ={(qj:i,  X,p)€S)},  i.e.,  <5,-  consists  of  all  transitions  in  GR  that  emanate  from 
states  in  Qi . 


The  GeneraL-LRO  Recognizer 

The  general  context-free  recognizer,  informally  named  GeneraL-LRO,  is  described  in 
this  section.  Concurrently,  intuitive  arguments  for  its  correctness  are  presented.  Establish- 
ing the  correctness  of  GeneraL_LRO  reduces  to  demonstrating  that  it  is  a faithful  realization 
of  the  GeneraLLR  recognition  scheme,  i.e.,  that  the  sets  of  viable  prefix  associates  that 
General_LR  tracks  are  correctly  represented  in  the  graph  constructed  by  GeneraLLRO  as  w 
is  scanned  from  left  to  right. 

GeneraL-LRO  is  described  in  terms  of  how  it  operates  when  it  is  applied  to  G and  w. 
Under  the  guidance  of  MC(G),  the  LR(0)  automaton  of  G,  GeneraL-LRO  constructs  a recog- 
nition graph  Gr(Mc).  Some  general  notions  about  recognition  graphs  were  introduced  in  the 
last  section.  The  description  of  GeneraL-LRO  that  follows  provides  more  specific  details 
about  how  GR  is  derived  from  Mc  and  w.  For  reference,  GeneraL-LRO  is  rendered  in  pseu- 
docode in  Figure  6.1. 
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1.  function  General_LRO(G'  =( V,  T,P,S );  wET*) 

2.  //  w=ala2  ■ ■ ■ an+1,  n >0,  a,er\{$},  l<t<n,  a„+1=$ 

3.  //  Let  MC(G)=(I,  V,  goto,  I0, 1)  be  the  LR(0)  automaton  for  G. 

4.  //  GR{MC)—{Q,  V,  S)  is  an  STG,  the  recognition  graph. 


5. 

6. 

7. 

8. 
9. 


Q,  & :=  {90:0).  0 //  Initialize  GR. 

1 1 Let  Mr  =(Gr\ q0:0,  Q0).  Then  L (MR)  = PVP(G,  e)  = {e}. 
for  i :=  0 to  n do 

//  Let  Mr  =(Gr\  q0:0,  Q,).  Then  L (MR)  = PVP(G,  i:w). 


10. 

/ / Let  Mr  —{Gr  \ g0:o>  Q> 

11. 

Shift  (t) 

12. 

//  Let  Mr  ={Gr  \ q0:0,  Q, 

13. 

if  <3,.+i=0then  Reject(t 

14. 

od 

15. 

//  Let  Mr  ={Gr1,  q0:0,  C?„+1). 

16. 

Accept(te) 

17. 

end 

18. 

function  Shift  (t) 

19. 

Q-Subset  {q  EQ{  | goto (V{g), 

20. 

while  Q_subset  ^0do 

21. 

q ■—  Remove(Q_subset) 

22. 

if  then 

23. 

Q -QUtij-.i+i} 

24. 

fi 

25. 

<5:=<5U{(<7;:i+1,  a,+1,  q)} 

26. 

od 

27. 

end 

1 1 Let  goto (V<g),  a1+1)  = 


//  Never  redundant. 


Figure  6.1  — The  GeneraLLRO  Recognizer 


Throughout  its  evolution,  the  structure  of  GR  is  paramount.  Certain  intermediate 
stages  in  its  construction  hold  particular  interest.  At  each  of  these  points,  an  FSA  may  be 
defined  in  terms  of  GRl  which  accepts  one  of  the  sets  of  viable  prefix  associates  that  is  com- 
puted by  the  General_LR  recognition  scheme.  The  FSA  derived  from  GRl  is  denoted  by 
Mr.  The  inverse  of  GR  is  desired  since  each  of  its  transitions  is  reversed  from  the  orientar 
tion  of  the  corresponding  transition  in  Mc. 

It  is  important  to  remember  that  GR  evolves  continuously  throughout  the  recognition 
process.  Consequently,  “ GR ” and  “MR”  denote  a different  graph  and  automaton, 
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28. 

29. 

30. 

31. 

32. 

33. 

34. 

35. 

36. 

37. 

38. 

39. 

40. 

41. 


42. 

43. 

44. 

45. 

46. 


function  Reduce  (i) 

<5_subset  :=  6i 
Traverse^,-,  i) 
while  (Lsubset  #0do 

(p  ,X,q)  \=  Remove(<5_subset) 

for  A — ► aX  • E t/(p ) such  that  /?=**€  do 

for  r £succ(g , ) do  //  Let  goto (V{r),A)  = /,• 

if  tlien 

Q :=QU{qj:i} 

Traverse({gj:,  },  i) 

fi 

if  ( ?,:, ■,  A,  r)£d then 

d:=dU{(g;:i,,4,r)} 

(Lsubset  :=  Lsubset  A , r)} 

fi 


od 


od 


od 


end 


47. 

48. 

49. 

50. 

51. 

52. 

53. 

54. 

55. 

56. 

57. 

58. 


function  Traverse(Q_subset,  i) 
while  Q_subset  /0do 

q :=  Remove(Q_subset) 

for  goto(V(g),J4)  = Ij  such  that  A =**e  do  //  A EN 
if  qj-itfiQ  then 

Q :=Q  U{qj:i} 

Q_subset  :=  Q-subset  U{^.,  } 

fi 

d dU{(<fy.,-,  A,  q)}  //  Never  redundant. 

od 

od 

end 


Figure  6.1  — continued 


respectively,  at  distinct  stages  of  recognition.  The  makeup  of  GR  at  any  given  time  deter- 
mines which  regular  set  is  recognized  by  MR.  The  GeneraLLRO  recognizer  is  best  under- 
stood through  an  appreciation  of  how  it  transforms  Gr. 

The  General_LR0  recognizer  is  comprised  of  a main  function  (lines  1-17  in  Figure  6.1) 
and  three  auxiliary  functions,  Shift,  Reduce,  and  Traverse.  The  Shift  function  (lines  18-27) 
computes  the  «-  relation  whereas  Reduce  (lines  28-46)  computes  the  (=*  relation  closure.  The 
Traverse  function  (lines  47-58)  is  called  from  within  Reduce.  It  handles  certain  transitions  on 
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nullable  nonterminal  symbols.  A line-by-line  description  of  the  General_LRO  recognizer  fol- 
lows. 

(Line  1)  General_LRO  is  supplied  with  two  arguments,  a reduced  $- augmented  grammar 
G and  a string  w over  the  terminal  alphabet  of  G . 

(Lines  2-4)  By  assumption,  w is  terminated  with  $.  For  simplicity,  we  also  assume  that 
the  LR(0)  automaton  of  G,  MC(G),  is  provided  by  some  external  agent.1  Each  of  w,  Mc, 
and  GR  are  visible  to  the  functions  that  require  access  to  them. 

(5-6)  Graph  GR  is  initialized  to  contain  the  single  state  q00.  The  comment  in  line  6 
indicates  that  GRl  can  be  trivially  embedded  into  an  FSA  that  accepts  PVP(G',e)  — {e}  at 
this  point.  Henceforth,  the  following  statement  holds  for  GR  throughout  the  duration  of 
recognition.  For  gj:i  £ Q where  0<j<m—  1 and  0<i<n+l,  each  path  in  GR  from  qj:i  to 
9oo  (1)  spells  the  reversal  of  a string  in  VP(G,  i:w),  and  (2)  corresponds  to  the  reversal  of  a 
path  from  70  to  Ij  in  Mc.  As  seen  below,  even  stronger  statements  may  be  made  about  GR 
at  particular  points  during  recognition. 

(7)  This  for  loop  iterates  once  for  each  terminal  symbol  in  w.  Having  i range  from  0 
to  n rather  than  from  1 to  n+1  yielded  a cleaner  expression  of  the  algorithm.  The  rest  of 
the  discussion  primarily  elaborates  on  an  Ith  iteration  of  this  for  loop  for  some  i,  0<i  <n. 

(8-10)  The  comment  in  line  8 is  both  a loop  invariant  and  a precondition  of  the  Reduce 
function.  It  clearly  holds  upon  entry  to  the  loop;  the  Reduce  and  Shift  functions  ensure  that 
it  also  holds  at  the  start  of  each  iteration.  This  condition  can  be  alternately  stated  as  fol- 
lows. A string  V*  is  a member  of  PVP(G,  i:w)  if  and  only  if  there  is  a path  in  GR  from 
some  state  q € <5,  to  q0.0  which  spells  ^ . The  comment  in  line  10  is  a postcondition  of  the 
Reduce  function  and  may  be  restated  similarly;  that  is,  a string  lEV*  is  a member  of 
VP(G',»:io)  if  and  only  if  there  is  a path  in  GR  from  some  state  q GQ,  to  q0.0  which  spells 
7s.  Assuming  that  the  precondition  holds  when  Reduce  is  called,  the  Reduce  function 
transforms  GR  so  that  the  postcondition  holds. 

1 An  alternative  is  for  General_L.R0  to  construct  M ^ as  an  initial  task. 
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(10-12)  The  postcondition  of  Reduce  in  line  10  is  also  a precondition  of  the  Shift  func- 
tion. A postcondition  of  the  Shift  function  is  given  in  line  12  and  is  similar  to  the  loop  invari- 
ant. However,  in  this  case  the  following  situation  holds  for  GR.  A string  7 a,+1  GE*  is  a 
member  of  PVP(G,  i+l:u>)  if  and  only  if  there  is  a path  in  GR  from  some  state  q G<9i+1  to 
q0.0  which  spells  0,^7^.  Assuming  that  the  precondition  holds  when  Shift  is  called,  the  Shift 
function  transforms  GR  so  that  this  postcondition  holds. 

(12-13)  If  Qi+i=0  at  this  point,  then  MR  has  no  final  states.  Thus,  PVP(G,  i+l:tr)  = 
0 and  i+\.w  (^PREFIX(G).  Consequently,  w ^L(G),  so  General—LRO  rejects  w . 

(15-16)  Line  15  expresses  a postcondition  of  the  for  loop.  It  holds  upon  completion  of 
the  nth  iteration  (i.e.,  when  i =n)  provided  that  the  postcondition  of  Shift  and  Qi+ both 
hold  at  the  end  of  that  iteration.  In  this  case,  w GL(G),  so  General—LRO  accepts  w. 

Before  continuing  with  the  description  of  General—LRO,  the  following  important  proper- 
ties of  LR(0)  automata  are  reiterated.  Let  A— kj*  G/;  hold  for  some  A—*-u>£P  with  A / S 1 
and  7-G7.  In  addition,  let  6ui  be  the  spelling  of  an  arbitrary  path  in  Mc  from  70  to  7;  for 
some  6GE*.  Then  <5cj|=&4  holds  in  G.  Now  let  A— a/?G7;-  hold  for  some  A— ► aa/?GP 
and  7;  G7,  and  let  6a  be  the  spelling  of  an  arbitrary  path  in  Mc  from  70  to  Ij.  In  this  case 
<5a«-a<5aa  holds  in  G.  Based  on  the  manner  in  which  GR  is  derived  from  Mc,  these  two 
equivalence  properties  (i.e.,  the  equivalence  of  paths  from  I0  to  Ij  with  respect  to  reduce  and 
shift  actions)  are  preserved  in  GR  (i.e.,  all  paths  in  Gr 1 from  <70-o  are  equivalent  with 

respect  to  shift  and  reduce  actions).  These  equivalence  properties  are  exploited  by  the  Shift 
and  Reduce  functions. 

(11,18)  The  Shift  function  is  called  with  i as  an  argument.  This  makes  the  relationship 
between  the  values  of  » in  GeneraLLRO  and  Shift  explicit.  The  operation  of  the  Shift  func- 
tion during  its  »'th  invocation  from  General_LR0  is  described  for  some  1,  0<i  <n. 

(19)  At  this  point,  we  know  that  <9,-  cannot  be  empty.  Otherwise,  the  input  string 
would  have  been  rejected  in  an  earlier  iteration  of  the  main  for  loop.  The  tth  call  to  Shift 
computes  the  «-fl  relation.2  Thus,  we  want  to  determine  all  states  q G <9,  for  which  there  is 


2 It  is  important  to  remember  that  1 ranges  from  0 to  Tl 
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a transition  on  aI+1  from  in  Mc-  The  set  variable  called  Q_subset  is  initialized  to  con- 
tain these  states. 

(20)  Each  state  in  Q_subset  is  considered  in  turn.  No  additional  states  are  added  to 
Q-subset  within  the  while  loop. 

(21-25)  A state  q is  removed  from  Q-subset.  Since  {ij^q),  al+1,  Ij)  is  a transition  in  Mc, 
we  need  to  add  g;-.,+1  to  Q and  (<7;-,+1,  a,+1,  q)  to  <5.  It  is  possible  that  there  is  more  than  one 
transition  on  ai+1  to  Ij  in  Mc,  so  <7y.,-+1  may  have  been  added  to  Q in  an  earlier  iteration  of 
the  while  loop.  This  condition  is  checked  in  line  22  and  is  added  to  Q only  if  it  is 

necessary.  However,  the  transition  (<7y;i+1,  o,+i,  <?)  cannot  already  be  in  6 since  there  is  only 
one  transition  on  ai+1  from  rj^q)  in  Mc.  This  transition  is  added  to  6 in  line  25. 

(27)  By  assumption,  the  precondition  in  line  10  holds  w’hen  Shift  is  called.  Based  on  the 
manner  in  which  certain  paths  in  GR  are  extended  by  the  Shift  function  under  the  guidance 
of  Mc,  the  postcondition  of  Shift  holds  at  this  point. 

The  transformations  of  GR  made  by  Reduce  are  considerably  more  elaborate.  This  is 
not  unexpected  since  Reduce  computes  the  reflexive-transitive  closure  of  a relation. 

The  operation  of  the  Reduce  function  during  its  Ith  invocation  from  General_LR0  is 
described  for  some  i,  0 < i < n . During  this  invocation,  Reduce  adds  states  to  Q,  and  installs 
transitions  from  states  in  Qi  to  states  in  Qj  where  0 <j  <i.  The  transitions  from  states  in 
Qi  to  states  in  Qj  where  0 <j  < i are  handled  directly  by  Reduce.  On  the  other  hand,  the 
transitions  among  states  in  Q,  warrant  special  treatment.  They  are  problematic  in  the  gen- 
eral case  as  they  can  introduce  cycles  into  the  recognition  graph.  These  transitions,  always 
made  on  nullable  nonterminals,  are  handled  separately  by  the  Traverse  function. 

(9,28)  Like  Shift,  the  Reduce  function  is  supplied  with  i as  an  argument  so  that  the 
relationship  between  the  values  of  : in  General_LR0  and  Reduce  is  explicit. 

(29)  At  this  point,  each  transition  in  <5,-  may  come  from  a state  that  calls  for  one  or 
more  reductions.  If  i =0,  then  there  are  no  applicable  transitions.  If  i > 0,  the  relevant 
transitions  were  installed  in  gr  by  Shift  during  the  previous  iteration  of  the  main  for  loop  of 
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General_LRO.  In  any  case,  a set  variable  called  (Lsubset  is  initialized  to  contain  it  is  cru- 
cial that  this  assignment  occur  before  Traverse  is  called. 

(30)  In  short,  Traverse  creates  certain  paths  to  states  in  Qi  that  spell  strings  of  nullable 
nonterminal  symbols.  Further  discussion  of  the  Traverse  function  is  deferred  until  later. 
The  Reduce  function  can  be  understood  independently  of  it. 

(31)  Each  transition  in  <Lsubset  is  considered  in  turn.  All  reductions  relevant  to  the 
source  states  of  those  transitions  are  performed.  Additional  transitions  may  be  added  to 
A-subset  within  this  loop. 

(32)  A transition  (p ,X,  q)  is  removed  from  <5Lsubset. 

(33)  The  set  of  items  ^p)  determines  what  reductions,  if  any,  are  applicable  to  p.  Any 

kernel  item  of  the  form  A —*■  aX  • P E^p)  such  that  holds  in  G is  relevant;  that  is,  we 

see  through  certain  nullable  suffixes  of  production  right-hand  sides.  In  effect,  a reduction 
from  p on  A—+aX  is  performed.  As  described  below,  a path  to  p spelling  PR  will  have  been 
installed  in  GR  by  an  earlier  call  to  the  Traverse  function.  In  this  way,  any  cycles  created  in 
Qi  by  nullable  nonterminals  is  left  for  Traverse  to  handle. 

(34)  At  this  point  we  are  considering  one  particular  reduction  applicable  to  p,  say 
A-*-aX‘PQ.'iJj{p)  where  P is  nullable.  This  reduction  is  performed  by  traversing  certain 
paths  in  GR  from  p that  spell  (Xa)^  to  locate  the  states  in  Q to  which  transitions  on  A 
must  be  made.  In  particular,  we  want  to  traverse  only  those  paths  that  start  with  the  tran- 
sition ( p,X,q ).  Any  other  transition  from  p will  have  either  already  been  reduced  through 
or  else  is  in  &_subset  waiting  to  be  handled  in  a later  iteration  of  the  while  loop.  The  states 
of  interest  are  given  by  succ(g,Q^).  It  is  precisely  this  application  of  succ  that  motivates 
reversing  the  transitions  in  GR  with  respect  to  those  in  Mc. 

(35-42)  At  this  point  we  are  dealing  with  one  particular  state  r £succ(<7, a^)  and  we 
assume  that  goto (il^r),A)  = Ij  for  some  /;  £/.  Thus,  we  need  a state  qyi  in  Qi  and  a transi- 
tion (</y.,-,A,r)  in  Both  of  these  objects  may  already  exist  in  GR,  so  they  are  condition- 
ally created  as  indicated  by  the  if  statements.  Incidentally,  a transition  is  generated  redun- 
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dantly  here  as  the  result  of  an  ambiguity.  If  the  transition  is  indeed  new,  it  is  added  to 
<S_subset;  any  relevant  reductions  from  g-.t-  are  performed  through  this  transition  when  it  is 
removed  from  <5_subset  in  a later  iteration  of  the  while  loop. 

(46)  The  postcondition  of  Reduce  holds  at  this  point.  To  help  establish  this  fact,  a sub- 
set of  VP(G,  i:w),  denoted  by  VP'(G,  i:w),  is  defined  as  follows:  (1)  for  » =0,  VP'(G',0:u>)  = 
PVP(G, O.u;);  (2)  for  0<«<n,  VP \G,i:w)  = {aXGVP(G)|  a£YP(G , j:w),  0 <j<i, 
X=*?aj+1aj+2  -a,  holds  in  G}.  For  0<»<n,  PVP(G,  rw)  C VP'(G,  »:tr)  C VP(G,  .:u;) 

clearly  holds.  The  states  and  transitions  added  to  GR  directly  by  Reduce  ensure  that 
VP'( G , i:w)  C L(MR)  holds.  The  contribution  that  Traverse  makes  to  the  transformation  of 
GR  can  be  assessed  by  noting  that  VP(G,  i:w)  = {a/3GYP(G)  \ aEVP'(G,  i:w),  P=*r  e holds 
in  G}.  The  Traverse  function  creates  any  additional  states  in  and  transitions  among 
those  states  so  that  VP(G,  j:u>)\VP'(G,  i:w)  C L {MR)  also  holds.  Together,  the  Reduce  and 
Traverse  function  guarantee  that  L (MR)  = VP(G,  i:w). 

(30,37,47)  Traverse  deals  solely  with  nullable  nonterminals  and  productions  with  null- 
able  right-hand  sides.  In  lines  30  and  37,  Traverse  is  called  with  a nonempty  subset  of  as 
an  argument  which  becomes  associated  with  the  set  variable  called  Q_subset.  Traverse  has 
the  effect  of  transforming  GR  as  if  all  sequences  of  reductions  by  productions  that  have  null- 
able  right-hand  sides  are  carried  out  from  the  states  in  Q_subset.  However,  a transformation 
of  Gr  that  produces  the  same  result  can  be  derived  from  a simple  traversal  of  Mc.  By 
adopting  this  alternative  approach,  complications  that  can  arise  due  to  cycles  in  GR  are 
avoided.  Consider  the  states  Ik  G/  such  that  tp(q)=Ik  for  some  q GQ_subset  and  traverse 
Mc  beginning  from  these  states  along  all  transitions  that  are  made  on  nullable  nonterminals. 
The  states  and  transitions  encountered  in  this  traversal  are  exactly  those  which  would  arise 
from  performing  the  reduction  sequences  described  above.  Consequently,  counterparts  for  all 
of  the  states  and  transitions  encountered  in  this  traversal  are  created  in  Gr-  Thus,  a partic- 
ular subgraph  of  Mc  is  effectively  embedded  in  Q{  by  this  process.  The  specific  subgraph  is 
determined  by  the  composition  of  Q_subset  when  Traverse  is  called. 
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(48)  Each  state  in  Q_subset  is  considered  in  turn.  Additional  states  may  be  added  to 
Q_subset  within  the  loop. 

(49)  A state  q is  removed  from  Q-subset. 

(50)  All  transitions  from  r/{q)  in  Mc  that  are  made  on  some  nullable  nonterminal  A are 
relevant.  Let  goto(if{q),  A)  = 7;  be  one  such  transition. 

(51-55)  We  need  a state  g;-.,  in  and  a transition  (q^.^A , q)  in  6,-.  This  state  may 
already  exist  in  Q{,  so  it  is  conditionally  created.  If  fy.,  is  indeed  new,  it  is  added  to 
Q-subset;  the  traversal  will  resume  from  g;-.{  when  it  is  removed  from  Q_subset  in  a later 
iteration  of  the  while  loop.  However,  the  transition  ( qj.i}A,q ) is  never  generated  redun- 
dantly; the  discipline  imposed  by  the  graph  traversal  ensures  that  the  transitions  from  each 
state  encountered  are  considered  at  most  once. 

If  the  two  calls  to  Traverse  are  removed  from  the  Reduce  function  and  the  line  “9a. 
Traverse(Q,-, :)”  is  added  to  General_LR0  following  line  9,  an  equivalent  transformation  of 
Gr  results,  i.e. , one  that  satisfies  the  condition  stated  in  line  10.  In  this  way,  Traverse 
becomes  a postprocessor  of  Reduce.  However,  for  the  purposes  of  parsing  it  is  more 
appropriate  to  call  Traverse  from  within  Reduce  as  we  have  done  in  Figure  6.1.  This  will 
become  evident  in  the  next  chapter  when  GeneraLLRO  is  extended  into  a general  parser. 

That  GeneraLLRO  correctly  implements  the  GeneraLLR  recognition  scheme  may  be 
established  by  induction  on  i.  This  induction  depends,  in  turn,  on  proving  that  the  Reduce 
(resp.  Shift)  function  correctly  transforms  Gr  such  that  the  postcondition  in  line  10  (resp. 
line  12)  holds  if  the  precondition  in  line  8 (resp.  line  10)  holds  before  the  function  is  called. 
Although  the  Shift  and  Reduce  functions  are  not  formally  proven  correct,  it  is  expected  that 
the  above  detailed  explanation  of  GeneraLLRO  provides  sufficient  intuitive  evidence  toward 
that  end. 
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Earley’s  Algorithm  Revisited 

A general  recognizer  that  operates  strikingly  similar  to  Earley'  is  obtained  by  modifying 
GeneraL-LRO  to  use  a particular  nondeterministic  variant  of  the  LR(0)  automaton  for  G as  a 
control  automaton.  The  alternate  control  automaton,  the  modified  algorithm,  and  its  rela- 
tionship to  Earley'  are  briefly  discussed  in  this  section. 

Alternate  Control  Automata 

The  nondeterministic  LR(0)  (or  NLR(O))  automaton  of  G [24,  p.  250]  is  denoted  here  by 
mNc(G)=(,L  ^,g°to ,/0,  /)  where 

(1)  I={I0,IV  . . . = {{A  —►a*/?}  | A —+a(3EP }, 

(2)  goto({A— >-a-Xp},X)  = {A— kxX'P},  and 

(3)  {B-*  •w}Ggoto({A  — t-aeBp),  e)  for  each  B—^ojEP. 

In  this  case,  we  prescribe  that  I0—{S'— ► *5$}  and  /TO_!  ={5'— ► .?$•}.  Again,  MNC(G)  is 
simplified  to  MNC  when  G is  understood.  If  the  standard  subset  construction  algorithm  for 
converting  NFAs  to  DFAs  is  applied  to  MNC,  the  (deterministic)  LR(0)  automaton  of  G is 
obtained,  i.e.,  MC(G). 

Some  functions  related  to  succ  and  pred  are  needed  for  navigating  through  NLR(O) 
automata  and  the  recognition  graphs  derived  from  them.  Toward  that  end,  let  G0=(Q,  E,6) 
be  an  STG.  The  £-succ  and  I^-pred  functions,  both  of  type  QXE*-+  2®,  are  defined  recur- 
sively as  follows. 

(1)  For  q€Q,  £-succ(g , e)  = -£pred(g, e)  = {?}; 

(2)  for  p G Q , a G E,  and  x G 27, 

£-succ(p , xa ) = {r  G Q | q G£-succ (p,x),(q,  a,  r)G<5)  and 
E-pred {p , ax)  = {r  G<Q  | q G£-pred(p , z), (r,  a,  q)E6}. 

Thus,  .E-succ  and  .C-pred  effectively  ignore  e-transitions.  Note  that  if  G0  is  e-free,  then  E- 
succ  (resp.  .E-pred)  is  identical  to  succ  (resp.  pred).  The  e-succ  and  e-pred  functions,  both  of 
type  (J— ► 2®,  are  defined  for  dealing  with  e-transitions.  For  pEQ,  e-succ (p)  = 
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{q  € Q | (p,  f,  and  e-pred(p)  = {q  (EQ  \ (q,  e,  p)E<5).  All  four  of  these  functions  extend 

to  subsets  of  Q in  the  usual  fashion. 

The  following  facts  apply  to  the  NLR(O)  automaton  MNC{G). 

(1)  L(Mnc(G))  =VP(G). 

(2)  Each  /;E/\{/0}  has  a unique  entry  symbol  XGFU{e},  again  denoted  by 
entry  (7;). 

(3)  For  {A  — such  that  A 9^S',  V-pred({A  — ►a*/?, a})  = {A— ► •cr/?}  and 
goto {Ij,A)  is  defined  for  each  Ij  Ee-pred({A— ► •aP}). 

An  Alternate  Recognizer 

The  General_LRO  recognizer  is  modified  to  employ  the  NLR(O)  automaton  of  G as  a 
control  automaton  in  place  of  the  LR(0)  automaton.  The  resulting  algorithm,  called 
General_NLRO,  is  displayed  in  Figure  6.2.  Only  a small  number  of  minor  changes  were 
required  to  derive  GeneraLNLRO  from  General_LRO.  The  differences  between  the  two 
recognizers  are  discussed  next. 

The  lines  in  Figure  6.2  were  numbered  so  as  to  emphasize  the  correlation  between  the 
General_LRO  and  General_NLRO  recognizers.  Consequently,  the  line  numbers  cited  below 
reference  code  in  both  Figures  6.1  and  6.2. 

(3-4)  It  is  explicitly  recorded  that  the  NLR(O)  automaton  of  G,  MNC(G),  is  used  as  the 
control  automaton  in  General_NLRO.  Thus,  the  recognition  graph  constructed  by 
GeneraLNLRO,  GR(MNC),  is  derived  from  MNC  and  the  input  string  tv. 

(23)  A state  Ij  of  MNC  has  more  than  one  in-coming  transition  only  if  entry(/;)  = e. 
Therefore,  9J:,+1  is  unconditionally  added  to  Q at  this  point,  i.e.,  lines  22  and  24  are  not 
needed  in  Figure  6.2. 

(33)  Each  set  of  items  in  MNC  is  a singleton,  so  at  most  one  reduction  can  apply  to 
^(p).  Thus,  an  if  construct  is  more  appropriate  here  in  place  of  the  for  loop  of  Figure  6.1. 
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1.  function  General_NLRO(G  =( V,  T,P,S );  wET*) 

2.  //  w =a1o2  • ■ • a„+1,  n >0,  aiET\{$},  1 <i<n,  an+1=$ 

3.  //  Let  Mnc(G)=(I,  V,  goto,  I0,  /)  be  the  NLR(O)  automaton  for  G. 

4-  //  G*  (Ma ic)={Qi  V,  <5)  is  an  STG,  the  recognition  graph. 

5.  Q,  8 :=  {<?0.0},  0 //  Initialize 

6.  //  Let  M*  =(G*\  90:0,  Q0).  Then  L(MR)  = PVP(G,  e)  = {e}. 

7.  for  i :=  0 to  n do 

8.  //Let  Then  L(M*)  = PVP(G,  i:»). 

9.  Reduce  (i) 

10.  //  Let  Mfl  q0:0,  Q{).  Then  L(MR)  = VP(G,  i:w). 

11.  Shift  (i) 

12.  //  Let  Mr  ={Gr1,  q00,  gt+1).  Then  L(MR)  = PVP(G,  i+l:w). 

13.  if  Qi+i  =0  then  Reject(tc)  fi 

14.  od 

15.  //  Let  A/*  =(Gr\  q0:0,  Qn+1).  Then  L(MR)  = PVP(G,  tv)  = {5$}. 

16.  Accept(u>) 

17.  end 


18. 

19. 

20. 
21. 
23. 

25. 

26. 
27. 


function  Shift  (t) 

Q_subset  :=  {9  E Q,  | goto (4{q),  a1+1)  is  defined  } 
while  Q_subset  ^ 0 do 

q :=  Remove(Q-^subset)  //  Let  goto (V(g),  a,+1)  = Ij. 
Q :=QU{?;-  :t+1}  //  Never  redundant. 

<5  :=  <5U{(<7;:t+i,  at'+i>  ?)}  //  Never  redundant. 

od 

end 


Figure  6.2  — The  GeneraLNLRO  Recognizer 


(34)  The  appropriate  successors  of  p along  paths  in  GR  that  spell  XoP  are  located 
using  the  T-succ  and  e-succ  functions  (instead  of  the  succ  function).  This  is  necessitated  by 
the  presence  of  e-transitions  in  Gr. 

(50)  Similar  to  General_LR0,  the  Traverse  function  of  General_NLR0  effectively  per- 
forms a certain  traversal  of  M^c-  However,  in  this  case  we  also  want  to  step  over  e- 
transitions.  Traversing  e-transitions  in  this  way  mirrors  the  Earley  Predictor  function. 


74 


28. 

29. 

30. 

31. 

32. 

33. 

34. 

35. 

36. 

37. 

38. 

39. 

40. 

41. 

42. 

43. 

44. 

45. 

46. 


function  Reduce  (t) 

(Lsubset  :=  di 
Traverse(Q;,  i) 
while  (Lsubset  /0do 

(p ,X,  q)  :=  Remove(Lsubset) 

if  {A  — >aX • /3}=il(p)  such  that /?=**£  then 

for  r 6e-succ(Vr-succ(g,oi?))  do  // Let  goto(V{r), A)  = 7;-. 
if  then 

Q:=QU{qj:i} 

Traverse({^,,}, :) 
fi 

if  (qj:i,A,  r)^<5then 

<5  :=  6U{(9y:i, v4 , r)} 

(Lsubset  :=  (Lsubset  U {(qj.^A,  r)} 

fi 

od 


od 


end 


47. 

48. 

49. 

50. 

51. 

52. 

53. 

54. 

55. 

56. 

57. 

58. 


function  Traverse(Q_subset,  t) 
while  Q_subset  /0do 

q :=  Remove(Q-subset) 

for  goto(il{q),X)  = Ij  such  that  X =**e  do  //  X ENU{e} 

if  Q then 

Q -=Q 

Q_subset  :=  Q_subset  U{(fy.,} 

fi 

<5  :=  6\J{(qj:i,X,  (?)}  //  Never  redundant. 

od 

od 

end 


Figure  6.2  — continued 


Relationship  to  Earley’s  Algorithm 

A connection  between  Earley’s  algorithm  and  GeneraLNLRO  is  established.  The  link 
between  these  two  algorithms  is  made  indirectly  through  Earley'.  Specifically,  we  describe  a 
correspondence  between  the  Earley  state  graph  constructed  by  Earley'  and  the  recognition 
graph  constructed  by  General_NLR0. 

Let  G1  =(Qi,  X,  6j)  and  G2=(Q 2, 27,  (52)  be  state-transition  graphs.  Graph  Gl  is 
homomorphic  (resp.  isomorphic)  to  graph  G2  if  there  exists  a surjection  (resp.  bijection) 
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f'Qx—+Q2  which  induces  a surjection  (resp.  bijection)  g:61—*62  defined  by 

9{{P,  a,  ?))=(/ (p),  a, /(?)),  P.gGQi,  aGrU{e}. 

Let  Mnc(G)  = (/,  V,  goto,  /0,  /)  with  I = {I0,IV  . . . ,Im_ x}  be  the  NLR(O)  automaton 
of  (7.  Let  Gei=(Qei,  V,6ei)  be  the  Earley  state  graph  constructed  by  Earley'  when  it  is 
applied  to  G and  w.  Lastly,  let  GE(MNC)=(Q , V,6)  be  the  recognition  graph  constructed 
by  GeneraLJVLRO  when  it  is  applied  to  G and  w.  Graph  GEt  is  homomorphic  to  GEl  as  fol- 
low's. The  function  fi-QEr-*Q  defined  by  f 1([A-*a»^,j]ESi)  = qk.i  where  Ik={A—*oe  ft) 
is  a surjection  which  induces  the  surjection  9i^e' — ►6-1  defined  by 

g1((r,X,8))={f1(8),XJl{r)),r,seQE,,XeV  U{e}. 

If  an  STG  G j is  homomorphic  to  an  STG  G2,  then  an  STG  Gx  can  be  derived  from  Gx 
such  that  Gx  is  homomorphic  to  Gx  and  Gx  is  isomorphic  to  G2.  Our  comparison  of  Earley' 
and  General_NLRO  is  concluded  by  defining  an  STG  GEt  = ( Qei , V,SEi)  such  that  GE>  is 
homomorphic  to  GEt  and  GE>  is  isomorphic  to  Gr- 

For  0<k<m—  1 and  0<»<n+l,  define  sk.{  by  «*;,  = {[4 j]  GS,- | 
Ik  ={j4— Kx»/3},0<j  <«}.  The  states  of  Gei  are  defined  by  QE<  = 

{sk  i | 0<&  <m—  l,0<i  <n+l,  sk.i  #0}.  Thus,  QE>  defines  a partition  of  Qei.  The  transi- 
tions of  Gei  are  defined  as  follows.  For  r,sEQEi  and  A'GFUje},  (r , X,  s)£6E,  if  and  only 
if  3r,s  EQE'  such  that  r Gr,  s Es,  and  (r,X,  s)E6ei.  By  construction,  Gei  is  homomorphic 
to  Gei. 

That  GEi  is  isomorphic  to  GEl  is  established  as  follows.  Define  the  function 
fz-Qs'—Q  by  f <^sk  i)  = qk  i-  The  function  f2  is  a bijection  which  induces  the  bijection 
defined  by  g2((r,X,s))  = (f 2(s), X, /^r )),  r,sEQEi,  XGFU{f}.  Therefore, 
Gei  is  isomorphic  to  Ge\ 

Implementation  Considerations 


For  the  remainder  of  this  chapter,  we  turn  our  attention  back  to  the  General—LRO 
recognizer.  In  this  section,  some  issues  that  are  pertinent  to  implementing  General_LRO  are 
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addressed.  Specifically,  means  for  properly  handling  graph  cycles  and  for  efficiently  imple- 
menting the  relevant  set  operations  and  the  succ  function  are  discussed.  A satisfactory  reso- 
lution of  these  issues  facilitates  the  complexity  analyses  undertaken  in  the  next  section. 

Graph  Cycles 

In  any  application  which  involves  graphs  that  are  not  necessarily  acyclic,  graph  cycles 
are  a matter  of  concern.  Neither  LR(0)  automata  nor  the  recognition  graphs  constructed  by 
General_LRO  are  guaranteed  to  be  acyclic. 

Let  MC(G)  denote  the  LR(0)  automaton  of  G and  let  GR(MC ) denote  the  recognition 
graph  constructed  by  General_LRO  when  it  is  applied  to  G and  w.  Since  all  paths  in  GR  are 
reflected  in  Mc,  albeit  in  reverse,  GR  is  cyclic  only  if  Mc  is  also  cyclic.  However,  the  con- 
verse does  not  hold;  Mc  may  have  cycles  that  are  not  replicated  in  a recognition  graph 
regardless  of  the  input  string. 

Properties  of  context-free  grammars  that  give  rise  to  cycles  of  any  kind  in  LR(O)  auto- 
mata are  identified  first.  Since  L (Mc)  — VP(G),  Mc  is  cyclic  if  and  only  if  VP(G)  contains 
strings  of  unbounded  length.  Thus,  Mc  is  cyclic  if  and  only  if  for  some  A EN,  aEV*  with 
a^e,  and  yET*,  A=$?aAy  holds  in  G.  That  is,  Vi  >0,  GVP(G)  for  some  6E  V*. 
Note  that  a may  contain  terminal  symbols. 

Grammatical  properties  which  give  rise  to  those  cycles  in  Mc  that  can  also  be  repro- 
duced in  Gr  are  considered  next.  Since  the  above  conditions  characterize  all  possible  cycles 
in  Mc,  a restriction  on  those  conditions  is  sought.  Assume  for  the  moment  that  GR  is  cyclic. 
Given  an  arbitrary  transition  in  GR  of  the  form  (qk:i,X,  gj:A),  we  know  that  h <i  holds. 
Thus,  a particular  cycle  in  GR  must  consist  solely  of  states  in  Q,  for  some  i,  0<t<n. 
Moreover,  every  transition  between  any  two  states  in  Q,  is  on  some  nullable  nonterminal 
symbol.  Consequently,  the  conditions  given  above  are  modified  as  follows.  A control  auto- 
maton Mc  has  a cycle  which  may  be  reproduced  in  GR  if  and  only  if  for  some  A EN,  aEV* 
with  a^6,  and  y E T*,  A =^,+ aAy  and  hold  in  G . Of  course,  whether  or  not  a cycle 
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is  actually  introduced  into  a recognition  graph  depends  on  the  input  string  as  well  as  the  sub- 
ject grammar. 

A result  by  Soisalon-Soininen  and  Tarhio  [40]  relating  to  the  concept  of  a looping  LR 
parser  was  helpful  in  identifying  the  grammatical  properties  that  give  rise  to  cyclic  recogni- 
tion graphs.  Looping  LR  parsers  are  discussed  in  conjunction  with  a method  for  constructing 
deterministic  LR  parsers  for  some  non-LR(Ar)  grammars  [2];  this  method  involves  disambig- 
uating multiply-defined  parse  table  entries.  A looping  LR  parser  is  an  LR  parser  that  has  a 
parsing  configuration  such  that  all  subsequent  actions  are  reductions.  The  non-LR(Ar)  gram- 
mars for  which  looping  LR  parsers  can  be  produced  (i.e.,  for  some  set  of  disambiguation 
choices)  can  be  characterized  as  follows. 

Fact  6.1  A looping  LR  parser  can  be  constructed  for  G if  and  only  if  for  some  A EN 
and  a,/3EV*  the  following  three  statements  hold  in  G:  (1)  A =*+aA/3,  (2)  a=^*e,  and  (3)  if 
a — e,  then  /?=»*e. 

Proof.  This  is  the  main  result  presented  by  Soisalon-Soininen  and  Tarhio  [40],  □ 

In  summary,  a cycle  in  Mc  is  introduced  into  GR  only  if  it  spells  a nontrivial  string  of 
nullable  nonterminal  symbols.  Paths  spelling  strings  of  nullable  nonterminals  which  can 
cause  cycles  are  introduced  into  GR  by  the  Traverse  function.  This  is  effectively  carried  out 
through  a traversal  of  Mc  where  each  state  in  Mc  is  considered  at  most  once.  Once  cycles 
are  present  in  GR,  they  are  traversed,  if  at  all,  in  the  Reduce  function.  Specifically,  the  com- 
putation of  the  succ  function  implies  a traversal  of  certain  paths  in  GR,  including  those 
which  contain  cycles.  An  implementation  of  the  succ  function  which  properly  deals  with 
cycles  in  GR  is  described  in  a later  subsection.  In  either  case,  cyclic  control  automata  and 
recognition  graphs  do  not  pose  any  particular  difficulty  to  General_LR0. 

Set  Operations 

Two  sets  are  maintained  by  GeneraL_LR0  during  recognition,  viz.,  Q and  6.  Two  set 
operations  are  used  in  the  process.  One  operation  is  that  of  determining  if  a particular 
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object  is  an  element  of  a set.  The  other  operation  is  that  of  adding  an  object  to  a set. 
Efficient  means  for  implementing  these  operations  with  respect  to  both  Q and  6 are 
described  below. 

The  operations  on  Q are  considered  first.  We  assume  that  the  states  in  Qi  are  stored 
on  a separate  linked  list  for  each  value  of  i.  Thus,  whether  or  not  q-^  exists  in  Q can  be 
determined  by  scanning  a list  of  at  most  m items.  A state  is  added  to  Q by  simply  linking  it 
into  the  appropriate  list.  Thus,  both  set  operations  of  interest  can  be  performed  with  respect 
to  Q in  constant  time. 

Membership  in  Q can  be  resolved  faster  using  the  following  scheme.  A boolean  flag  is 
associated  with  each  state  in  Mc.  The  flags  are  reset  to  false  at  the  beginning  of  each  itera- 
tion of  the  main  for  loop  in  General— LRO.  When  a state  q is  added  to  Q by  either  Reduce 
or  Shift  in  the  ith  iteration,  0<t<n,  the  flag  associated  with  ip(q)  is  set  to  true.  In  this 
way,  the  membership  of  q EQ  can  be  determined  during  the  ith  iteration  by  testing  the  flag 
associated  with  q ). 

The  overhead  associated  with  resetting  m boolean  flags  each  time  through  the  loop  can 
be  avoided  by  using  integer  flags  instead.  The  flags  are  initialized  to  —1.  When  a state  q is 
added  to  Q in  the  tth  iteration,  0<i  <n,  the  flag  associated  with  r/^q)  is  set  equal  to  i.  The 
membership  of  q in  Q is  resolved  during  the  «th  iteration  by  comparing  t with  the  value  of 
the  flag  for  If  the  flag’s  value  is  less  than  i,  then  q & Q.  Otherwise,  the  flag’s  value  is 

equal  to  i and  q EQ. 

Managing  the  transition  set  <5  is  slightly  more  involved.  We  assume  that  all  of  the  tran- 
sitions out  of  g;:i,  with  0<j  <m—  1 and  0<f  <n-t-l,  are  stored  on  a linked  list  attached  to 
qj:i.  Thus,  a new  transition  out  of  can  simply  be  linked  into  this  list.  However,  this  list 
may  contain  O(j-H)  items,  so  it  can  be  costly  to  scan  the  list  in  search  of  a transition.  An 
efficient  method  for  resolving  membership  with  respect  to  <5  is  described  as  follows.  Note 
that  we  need  only  be  concerned  with  transitions  on  nonterminals  since  transitions  on  termi- 
nals are  never  generated  redundantly.  Thus,  we  assume  that  entry(/;)  = A for  some  A £N. 
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Let  qkk),  ^ be  a transition  to  be  added  to  8.  We  assume  that  when  qkh  was 

created  an  integer  flag  was  attached  to  it  for  each  nonterminal  transition  out  of  Ik  in  Mc] 
these  flags  are  initialized  to  —1.  If  ( qj.i,A,qk.h)(£6 , then  the  flag  attached  to  qk.h  that  is 
associated  with  the  transition  out  of  4 on  i is  less  than  i.  When  the  transition  is  added  to 
8,  this  flag  is  set  equal  to  t.  The  effectiveness  of  this  scheme  is  a consequence  of  the  order  in 
which  transitions  are  added  to  8.  Thus,  both  set  operations  can  be  performed  with  respect  to 
8 in  constant  time  as  well. 

The  succ  Function 

The  last  significant  aspect  of  General_LRO  that  needs  explication  is  its  use  of  the  succ 
function.  This  subsection  proposes  one  approach  to  implementing  succ.  A revised  Reduce 
function  is  presented  which  incorporates  the  method.  The  modified  function  is  displayed  in 
Figure  6.3. 

Each  use  of  succ  in  Reduce  implies  that  a traversal  of  GR  is  carried  out.  An  auxiliary 
stack,  the  SuccJStack,  is  used  by  Reduce  to  effect  this  traversal.  Each  entry  in  the  stack 
records  an  intermediate  stage  in  the  traversal  of  GR  that  is  required  to  compute  the  succ 
function. 

Consider  the  reference  to  the  succ  function  in  line  34  of  Figure  6.1.  Based  on  properties 
of  control  automata  and  recognition  graphs,  the  following  holds:  succ(<7,a^)  = {r  EQ  |3  a 
path  in  GR  from  f to  r spelling  o^}  = {r€Q  |3  a path  in  GR  from  q to  r of  length 
len(a^)}.  Motivated  by  this  observation,  each  entry  in  Succ-Stack  is  a triple  ( r',A , d ) where 
(1)  r'  is  a state  in  GR  to  which  some  path  traversal  from  q has  progressed,  (2)  A is  the  left- 
hand  side  of  the  production  being  reduced,  and  (3)  d is  the  distance  left  to  go  before  a state 
in  succ(g,afi)  is  reached  where  d <len(a).  The  discussion  of  the  modified  Reduce  function 
that  follows  clarifies  how  Succ_Stack  is  used  to  compute  the  succ  function. 

(1-3)  These  three  lines  correspond  to  lines  28-30  of  Figure  6.1. 

(4)  The  Succ-Stack  is  initially  empty. 
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1.  function  Reduce (t)  // Revised  to  implement  the  succ  function. 

2.  Lsubset  :=  <5,- 

3.  Traverse(Q,-,  t) 

4.  Succ_Stack  :=  0 

5.  while  Succ_Stack  ^ 0 or  (Lsubset  / 0 do 

6.  if  SuccJStack  = 0 then 

7.  (p,X,  q)  :=  Remove(Lsubset) 

8.  for  A —*■  aX • ft &%l{p)  such  that  /?=**e  do 

9.  Push(Succ_Stack,  (7, A,len(a))) 


10. 

od 

11. 

else  / / Succ_Stack  / 0 

12. 

(r,A,d)  :=  Pop(Succ_Stack) 

13. 

if  d > 0 then  / / Let  entry (t/(r ))  = X. 

14. 

for  r'EQ  such  that  (r,X,  r')E6  do 

15. 

Push(Succ_Stack,  (r',A,  d—  1)) 

16. 

od 

17. 

else  //  d =0,  let  goto(i/(r),  A)  = Ij. 

18. 

Qj-i  & Q then 

19. 

Q 

20. 

Traverse({g-.,-},  j) 

21. 

fi 

22. 

if  (fy:i,A,r)g<5then 

23. 

6 :=  <5U{(<7;-.,-,  A,  r)} 

24. 

(Lsubset  :=  Lsubset  U{(<7,-.,-,  A,  r)} 

25. 

fi 

26. 

fi 

27. 

28.  od 

29.  end 

fi 

Figure  6.3  — A Modified  Reduce  Function 


(5)  This  while  loop  corresponds  to  the  while  loop  at  line  31  in  Figure  6.1.  However, 
in  this  case  there  are  two  collections  to  exhaust  before  the  loop  terminates. 

(6)  The  true  branch  of  the  if  statement  deals  with  items  in  Lsubset  and  the  false 
branch  deals  with  items  in  Succ_Stack.  The  if  predicate  is  written  so  that  items  in 
Succ_Stack  have  priority  over  items  in  Lsubset.  Clearly  the  predicate  is  false  in  the  first 
iteration  of  the  while  loop. 

(7-8)  These  two  lines  are  the  same  as  lines  32-33  of  Figure  6.1. 

(9)  Instead  of  invoking  the  succ  function  as  in  line  34  of  Figure  6.1,  we  initiate  the 
graph  traversal  of  GR  that  is  implied  by  that  use  of  succ.  Specifically,  (g,A,len(a))  is 
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pushed  onto  Succ_Stack  to  record  that  we  want  to  find  the  successors  of  q which  are  located 
at  the  ends  of  paths  of  length  len(a)  from  q;  moreover,  when  each  of  these  states  is  found,  a 
transition  on  A will  be  made  to  it  from  an  appropriate  state  in  Qt. 

(11)  The  Succ_Stack  is  not  empty,  so  one  of  its  entries  is  processed. 

(12)  An  item  ( r,A , d)  is  removed  from  Succ_Stack. 

(13-16)  If  d > 0,  then  the  stage  in  the  traversal  of  GR  that  is  recorded  by  ( r,A,d ) has 
not  progressed  far  enough.  Let  entry(V(r))  = X.  Then  every  transition  out  of  r is  on  X. 
For  each  state  r'£Q  such  that  ( r,X,r ')  is  a transition  in  GR,  (r',A,d—  1)  is  pushed  onto 
Succ_Stack.  By  effectively  moving  to  r',  the  length  of  the  traversal  has  been  increased  by  1. 
Consequently,  the  distance  remaining  is  decreased  by  1. 

(17-25)  If  d =0,  then  r€succ(g,o^)  for  some  q and  a referred  to  in  lines  7-9.  Lines 
18-25  are  identical  to  lines  35-42  of  Figure  6.1. 

The  Complexity  of  Recognition 

In  this  section,  some  worst-case  complexity  bounds  are  established  for  the  General_LR0 
recognizer.  Specifically,  we  consider  the  amount  of  space  and  time  required  by  General_LR0, 
in  the  worst  case,  when  it  is  applied  to  G and  w.  In  the  following,  it  is  convenient  to  assume 
that  tcEL(<7).  In  addition,  the  LR(0)  automaton  of  G,  MC(G),  is  assumed  to  have  m 
states. 

Bounds  on  space  requirements  are  derived  first.  They  are  useful  in  determining  the 
time  bounds.  In  both  cases,  bounds  are  established  for  arbitrary  G and  for  arbitrary  unam- 
biguous G . 

Space  Bounds 

The  space  complexity  of  GeneraL_LR0  is  determined  by  placing  an  upper  bound  on  the 
number  of  states  and  transitions  in  Gr  at  the  point  when  w is  accepted.  The  sizes  of  the 
auxiliary  data  structures,  i.e.,  Q-subset,  &_subset,  and  Succ_Stack,  are  accounted  for  later. 
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First,  we  assume  that  G is  arbitrary.  For  0<!<n,  and  Qn+i  contains 

one  state.  Thus,  there  are  at  most  m(n-fl)+l  GO(n)  states  in  GR. 

Consider  6i  for  some  »,  0<«  <n.  In  the  worst  case,  every  state  in  Q,  has  a transition 

to  every  state  in  U Q The  number  of  states  in  U Q,  is  at  most  mfj+1).  Conse- 
o<;  <«'  1 o <j<i  3 

quently,  since  Q(  has  at  most  m states,  <5,-  has  at  most  m2(»+l)  transitions.  In  addition,  <5n+1 

n 

contains  one  transition.  Thus,  there  are  at  most  1+ ^’m2(:+l) 6 0(n2)  transitions  in  GR. 

i-  o 

Summarizing,  GR  contains  at  most  0{n)  states  and  0(n2)  transitions.  Therefore,  the 
space  complexity  of  General_LRO  for  arbitrary  G is  0(n2).  An  ambiguous  grammar  that 
meets  this  worst-case  space  bound  is  the  following:  {S—*S  S | a | e}. 

The  space  complexity  of  General— LRO  remains  0(n2)  even  if  G is  unambiguous.  For 
example,  the  unambiguous  grammar  with  production  set  {5— *-a  S a | a | e}  meets  this 
worst-case  space  bound. 

Time  Bounds 

The  time  complexity  of  General— LRO  is  determined  by  placing  an  upper  bound  on  the 
time  required  to  construct  GR.  It  transpires  that  the  complexity  of  General—LRO  is  dom- 
inated by  the  complexity  of  the  Reduce  function.  The  following  remarks  are  made  in  light  of 
the  earlier  observations  regarding  the  efficiency  of  the  set  operations  used  by  General_LRO. 

The  main  function  invokes  the  Shift  and  Reduce  functions  n+1  times  each.  Thus,  the 
time  complexity  of  General—LRO  is  determined  from  the  time  spent  in  these  two  functions 
throughout  the  duration  of  recognition. 

At  most  m states  and  m transitions  are  installed  in  GR  during  any  one  invocation  of 
the  Shift  function.  Thus,  over  n+l  calls,  O(n)  time  is  spent  within  Shift. 

In  analyzing  the  complexity  of  the  Reduce  function,  the  time  spent  within  Traverse  is 
accounted  for  separately.  In  any  one  invocation  of  Reduce,  the  Traverse  function  is  called  at 
most  m times.  That  is,  in  the  worst  case  it  is  called  once  for  each  state  in  Q{.  Within  any 
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one  invocation  of  Traverse,  at  most  m states  and  m~  transitions  are  added  to  the  recognition 
graph.  Thus,  over  n+1  calls  to  Reduce,  0(n ) time  is  spent  within  the  Traverse  function. 

In  assessing  the  contribution  of  the  Reduce  function  to  the  time  complexity  of 
GeneraLXRO,  we  first  assume  that  G is  unambiguous.  For  some  i,  1 <t  <n,  the  jth  invoca- 
tion of  Reduce  is  analyzed.3  By  an  inspection  of  the  while  loop,  the  time  spent  within 
Reduce  is  based  on  the  number  of  items  that  are  cycled  through  (Lsubset  and  Succ_Stack. 
From  the  analysis  of  the  space  complexity  of  GeneraLLRO,  there  are  at  most  m2t  transitions 

from  states  in  Q{  to  states  in  U Q at  the  completion  of  the  jth  call  to  Reduce.  These 

o <;'<«' 

are  precisely  the  transitions  that  are  cycled  through  (Lsubset.  Although  at  least  one  of  these 
transitions  must  have  been  generated  in  the  most  recent  invocation  of  Shift,  for  simplicity  we 
assume  that  all  O(i)  of  them  are  created  by  Reduce.  Under  this  assumption,  each  transition 
in  6i  results  from  traversing  some  path  in  GR  that  spells  the  reversal  of  some  prefix  of  a pro- 
duction right-hand  side.  This  traversal  is  effected  through  the  use  of  the  Succ_Stack.  Let  p 
— max({len(o>)  | A — *-uEPj).  Thus,  at  most  m2ip  entries  are  cycled  through  Succ_Stack 
while  all  of  the  reductions  relevant  to  the  >th  call  to  Reduce  are  performed.  Together,  at 
most  m~i{p+\)  items  are  cycled  through  Lsubset  and  Succ_Stack.  Since 

n 

2Jm2i(p+\)£0(n2),  the  total  time  spent  in  Reduce  over  n+1  calls  is  0(n2).  Accumulating 

«'-i 

the  total  time  consumed  by  Shift,  Traverse,  and  Reduce,  we  conclude  that  GeneraL-LRO 
runs  in  0(n2)  time  in  the  worst  case  if  G is  unambiguous. 

Now  assume  that  G is  arbitrary.  Again,  we  want  to  determine  the  total  number  of 
items  cycled  through  Lsubset  and  Succ_Stack  during  the  »th  call  to  Reduce  for  some  i, 
l<i<n.  The  number  of  transitions  cycled  through  Lsubset  is  still  bounded  by  m2i.  A 
bound  on  the  number  of  entries  cycled  through  Succ_Stack  is  given  by  the  number  of  distinct 
paths  that  may  be  traversed  when  making  all  possible  reductions  back  through  those  transi- 
tions. Consider  one  of  the  0(i)  transitions  in  6it  say  (p,X,q).  Suppose  that 

3 All  of  the  work  is  done  by  Traverse  when  I =0  since  <5O=0  when  Reduce  is  called  in  that 
instance. 
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A—+aX • Er/{p).  Further  suppose  that  len(aAT)  = p.  While  traversing  all  of  the  paths  in  GR 
that  emanate  from  p,  pass  through  (p  ,X,  q),  and  spell  Xofi , an  upper  bound  on  the  number 

of  items  that  are  cycled  through  Succ_Stack  is  given  by  € 0(ip  *).  Since  there  are 

i-  o 

O(i)  transitions  in  that  may  be  reduced  back  through,  0(ip)  entries  may  be  cycled 

n 

through  Succ_Stack  during  the  «th  call  to  Reduce.  Since  £ipE  0(np+1),  General_LRO  runs 

i-i 

in  0(np+1)  time  in  the  worst  case. 

The  worst-case  running  time  of  General_LRO  does  not  compare  favorably  with  Earley’s 
recognizer.  However,  the  parsing  version  of  General_LRO  also  runs  in  0(np+1)  in  the  worst- 
case.  As  shown  in  the  next  chapter,  this  bound  more  properly  reflects  the  time  required  to 
construct  a convenient  representation  of  all  the  possible  parses  of  an  input  string.  In  con- 
trast, the  0(ns)  bound  does  not  take  into  account  the  time  required  by  Earley’s  algorithm  to 
analyze  its  more  indirectly  represented  parse  forest. 

We  have  not  yet  accounted  for  the  maximum  sizes  potentially  attained  by  the  auxiliary 
data  structures  Q_subset,  <i_subset,  and  Succ_Stack.  The  set  variable  Q-subset  holds  at 
most  m states  in  either  Shift  or  Traverse.  In  Reduce,  the  set  variable  djsubset  contains  at 
most  m2(i+l)  transitions.  Since  access  to  Succ_Stack  follows  a FIFO  discipline,  it  contains 
at  most  O(i)  entries  at  any  time.  Therefore,  the  space  required  for  these  structures  does  not 
contradict  the  w’orst-case  space  bounds  for  General_LRO  that  were  derived  above. 

On  Garbage  Collection  and  Lookahead 

Garbage  collection  and  lookahead  provide  means  for  improving  the  efficiency  of  the 
GeneraLLRO  recognizer.  Garbage  collection  is  relevant  to  reclaiming  the  space  occupied  by 
states  and  transitions  in  when  they  become  superfluous  to  the  remainder  of  the  recogni- 
tion task.  Lookahead  is  used  for  selectively  generating  only  those  states  and  transitions  that 
are  consistent  with  the  current  lookahead  string.  Some  basic  notions  regarding  the  use  of 
garbage  collection  and  lookahead  within  GeneraLLRO  are  discussed  briefly. 
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Recalling  the  set-theoretic  foundation  of  GeneraL-LRO  helps  to  motivate  the  utility  of 
garbage  collection.  Since  GR  represents  the  sets  of  viable  prefixes  that  are  tracked  by  the 
recognizer,  the  notion  of  a dead  state  as  it  applies  to  MR  identifies  nonessential  states  of  the 
recognition  graph.  Whether  GR  is  considered  at  line  10  or  line  12  of  General_LR0,  all  states 
that  are  dead  with  respect  to  MR  at  those  points,  as  well  as  all  transitions  emanating  from 
them,  are  no  longer  needed.  Consequently,  the  space  used  by  these  states  and  transitions  can 
be  reclaimed  for  later  use. 

In  order  to  determine  an  appropriate  location  within  GeneraLLRO  to  invoke  garbage 
collection,  note  that  if  MR  contains  no  dead  states  before  Reduce  is  called,  then  it  has  no 
dead  states  when  Reduce  terminates.  However,  the  same  remark  does  not  apply  to  the  Shift 
function.  In  particular,  states  can  become  dead  during  the  ith  call  to  Shift  where  0<t'<n  if 
a proper  subset  of  the  states  in  Qi  have  transitions  generated  to  them.  Thus,  it  is  convenient 
to  perform  garbage  collection  in  conjunction  with  the  Shift  function  by  anticipating  the 
states  that  become  dead  as  a result  of  it. 

An  appropriate  place  to  perform  garbage  collection  is  immediately  following  line  19  in 
the  Shift  function.  The  following  simple  scheme  is  sufficient. 

(1)  Mark  all  states  that  are  reached  in  a traversal  of  GR  that  begins  at  the  states  in 
Q_subset. 

(2)  In  a second  traversal  that  starts  from  the  states  in  Q,\  Q_subset,  delete  from  Q 
the  states  that  were  not  marked  in  step  (1)  and  delete  from  6 the  transitions  that 
emanate  from  those  states. 

Note  that  a garbage  collection  scheme  based  on  reference  counts  would  be  far  less  straight- 
forward due  to  the  self-references  which  arise  from  cycles  in  the  recognition  graph.  More- 
over, the  simple  mark-and-sweep  garbage  collection  procedure  outlined  above  applies  readily 
to  General_NLR0  as  well. 

Although  garbage  collection  can  improve  the  space  efficiency  of  GeneraLLRO,  it  obvi- 
ously incurs  a time  penalty.  For  0<t'<n,  there  are  0(»+l)  states  and  0((t+l)2)  transi- 
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tions  in  GR  prior  to  the  ith  call  of  Shift.  Thus,  the  procedure  outlined  above  may  be  per- 
formed in  0((i  +1)2)  time.  Observe  that  this  is  no  worse  than  the  worst-case  time  complex- 
ity of  the  Reduce  function. 

In  practice,  one  would  probably  want  to  perform  garbage  collection  less  seldom  than  on 
every  input  symbol.  Regardless,  a similar  procedure  involving  two  graph  traversals  would 
still  apply.  The  first  traversal  begins  from  certain  states  in  the  most  recently  completed 
state  subset  Qi  and  marks  all  states  reached  in  the  process.  In  the  second  traversal,  all 
unmarked  states  and  their  out-going  transitions  are  deleted  from  the  recognition  graph. 

The  basic  goal  of  garbage  collection  is  to  contract  periodically  the  size  of  the  recogni- 
tion graph.  As  a consequence,  space  taken  up  by  nonessential  states  and  transitions  becomes 
eligible  for  reuse.  In  contrast,  the  aim  of  lookahead  is  to  anticipate  the  states  and  transitions 
that  are  necessary  to  recognize  the  input  string.  In  short,  lookahead  is  used  within  Shift, 
Reduce,  and  Traverse  to  selectively  generate  those  states  and  transitions  that  are  consistent 
with  the  current  lookahead  string. 

In  order  to  make  use  of  lookahead,  the  items  in  the  control  automaton  are  attributed 
with  appropriate  lookahead  strings.  The  literature  on  the  computation  and  use  of  lookahead 
in  the  context  of  LR  parsers  is  quite  extensive.  The  type  of  lookahead  typically  used  in  con- 
junction with  LR(O)  automata  is  either  SLR(A:)  lookahead  [12]  or  LA1R(A;)  lookahead 
[8, 11, 29]. 4 Without  going  into  detail,  the  use  of  A:-symbol  lookahead  in  General_LR05  for 
some  k > 0 impacts  the  following  locations  in  Figure  6.1. 

(Line  19)  Q-subset  is  computed  to  contain  only  those  states  q (zQ{  such  that  the  shift 
on  ot+1  from  x^q)  is  consistent  with  the  lookahead  string. 

(33)  Only  those  reductions  are  initiated  from  p that  are  consistent  with  the  current  k- 
symbol  lookahead.  This  comment  also  applies  to  line  8 in  Figure  6.3. 

(50)  Transitions  on  nullable  nonterminal  symbols  are  selectively  made  based  on  their 
consistency  with  the  -symbol  lookahead  string. 

4 Almost  invariably,  k = 1 

s This  is  somewhat  of  a misnomer  when  lookahead  is  employed 
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The  costs  of  employing  lookahead  include  the  space  that  is  needed  for  storing  lookahead 
strings  in  the  control  automaton  and  the  time  associated  with  matching  the  Ar-symbol  look- 
ahead string  from  the  input  string  with  occurrences  of  it  in  the  control  automaton.  If  k =1 
as  is  generally  the  case,  the  overhead  of  using  lookahead  is  not  usually  an  issue. 

On  the  other  hand,  the  benefits  of  using  lookahead  can  be  substantial.  Space  is  saved 
by  reducing  the  number  of  states  and  transitions  that  are  needlessly  created.  In  addition, 
time  is  saved  that  would  otherwise  be  spent  generating  unnecessary  pieces  of  the  recognition 
graph  and  traversing  paths  that  would  be  called  for  by  Reduce  in  the  absence  of  lookahead. 
Most  significantly,  General_LRO  runs  in  linear  space  and  time  if  G is  an  LR(&)  grammar 
provided  that  fc-symbol  lookahead  is  used. 

Discussion 

The  Earley'  and  GeneraLLRO  recognizers  both  construct  state-transition  graphs.  In 
each  case,  the  STG  is  used  for  representing  the  sets  of  viable  prefixes  that  are  tracked  by  the 
GeneraLLR  recognition  scheme.  The  graph  constructed  by  Earley',  GEi,  is  derived  interpre- 
tively  in  the  sense  that  the  Earley  states  that  are  generated  during  recognition  drive  the  con- 
struction of  the  graph.  In  contrast,  GR  is  constructed  under  the  guidance  of  a precomputed 
control  automaton.  This  distinction  is  obscured  somewhat  by  the  General_NLRO  recognizer. 
GeneraLNLRO  constructs  a state-transition  graph  that  is  quite  similar  to  GEi,  but  does  so 
under  the  guidance  of  the  NLR(O)  automaton  of  G. 

The  General_LRO  and  General_NLRO  recognizers  illustrate  extremal  examples  of  a 
basic  approach  to  general  recognition  that  entails  constructing  a recognition  graph  under  the 
guidance  of  a controlling  automaton.  In  each  case,  (1)  the  structure  of  the  recognition  graph 
is  mirrored  in  the  control  automaton,  (2)  the  recognition  graph  is  used  to  represent  the  sets 
of  viable  prefixes  that  are  tracked  by  the  General_LR  recognition  scheme,  and  (3)  the  control 
automaton  accepts  the  viable  prefixes  of  G.  Other  possible  control  automata  are  suggested 
by  the  fact  that  the  LR(O)  automaton  of  G can  be  obtained  by  applying  the  subset  construe- 
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tion  algorithm  to  the  NLR(O)  automaton  of  G . Any  automaton  intermediate  between  the 
NLR(O)  and  LR(0)  automata  that  is  built  during  subset  construction  provides  a viable  candi- 
date for  a control  automaton.  One  main  advantage  of  LR(0)  automata  is  their  determinism, 
whereas  a favorable  feature  of  NLR(O)  automata  is  their  comparatively  smaller  number  of 
states.  Automata  that  are  intermediate  between  these  two  extremes  can  be  tailored  to  bal- 
ance both  of  these  factors.  The  choice  of  possible  control  automata  is  broadened  still  further 
when  lookahead  is  introduced.  An  investigation  of  alternate  control  automata  is  left  for 
future  work. 

Of  the  known  context-free  recognition  algorithms,  GeneraLJLRO  is  most  like  Tomita’s 
algorithm  without  lookahead  [42,43].  In  this  form,  Tomita’s  algorithm  interprets  a parse 
table  derived  from  the  LR(0)  automaton  of  G and  maintains  a so-called  graph-structured 
stack  that  is  similar  in  structure  to  our  recognition  graph.  However,  a transition  of  the  form 
(p , A,  q)  is  represented  by  two  edges  of  the  form  ( p , rA ) and  ( rA,q ) where  p , q correspond  to 
parse  states  and  rA  is  a symbol  vertex.  In  effect,  the  symbol  vertices  play  the  role  of  our 
transition  labels.  Due  to  the  use  of  these  symbol  vertices,  the  correspondence  between  the 
states  and  edges  in  the  graph-structured  stack  and  the  states  and  transitions  of  the  underly- 
ing LR(0)  automaton  is  not  as  precise  as  in  GeneraLLRO.  In  addition,  the  symbol  vertices 
needlessly  increase  the  number  of  vertices  and  edges  in  the  graph-structured  stack,  increase 
the  lengths  of  paths  that  are  traversed  during  reductions  by  a factor  of  2,  and  complicate  the 
operations  which  manage  the  stack. 

Tomita’s  algorithm  cannot  handle  cyclic  grammars  [42].  However,  it  also  fails  to  han- 
dle some  noncyclic  grammars  that  contain  e-productions.  In  short,  any  grammar  that  may 
introduce  a cycle  into  the  graph-structured  stack  is  troublesome.  These  grammars  are 
exactly  the  grammars  that  can  introduce  cycles  into  our  recognition  graphs. 

Tomita’s  algorithm  independently  keeps  track  of  edges  that  may  need  to  be  reduced 
back  through  and  states  that  have  yet  to  be  acted  on  (a  state  is  acted  on  to  determine  what 
parse  moves  are  relevant  to  it).  In  contrast,  other  than  the  special  attention  given  certain 


89 


nonterminal  transitions,  GeneraLLRO  uniformly  lets  the  transitions  stored  in  (Lsubset  drive 
the  reduction  process. 

The  special  handling  required  of  nullable  nonterminals  is  common  to  all  general  recog- 
nizers that  allow  e- productions.  The  manner  in  which  Tomita’s  algorithm  deals  with  e- 
productions  is  the  cause  for  its  limited  coverage.  For  i =0  to  tj-KL,  the  states  in  Ui  = 
U C/,-  j are  generated  by  Tomita’s  algorithm  as  follows  ((/,•  corresponds  to  our  Q{). 


(1)  Let  j =0. 

(2)  If  i =0,  then  U0  0 contains  only  the  start  state;  otherwise,  [/,■  0 is  comprised  of  the 
states  that  resulted  from  shift  moves  on  o,-  from  states  in  Ui-V 

(3)  If  all  of  the  states  in  U{  j have  been  considered,  then  all  of  the  reductions  have 
been  performed  at  stage  i.  The  shift  moves  on  a,+1  are  performed  next. 

(4)  Perform  all  pending  reductions  by  non-e-productions  from  states  in  C/t-  ■;  any  new 
state  that  is  created  is  placed  in  17,-  y. 

(5)  Perform  all  pending  reductions  by  e-productions  from  states  in  (7,-  any  new 
state  that  is  created  is  placed  in  [/,  -+1. 

(6)  Let  j =j- fl  and  return  to  step  (3). 

Thus,  reductions  by  e-productions  are  delayed  until  there  are  no  other  reductions  to  be 
made.  As  a consequence  of  this  treatment  of  e-productions,  ip-.  — *■  I is  not  necessarily  one- 
to-one  where  I represents  the  states  in  the  underlying  LR(0)  automaton.  This  is  an  undesir- 
able anomaly  that  further  obfuscates  the  operation  of  the  algorithm.  By  comparison, 
General_LR0  ensures  that  ip.Qi— ►/  is  always  one-to-one. 

The  fact  that  Tomita’s  algorithm  fails  to  handle  some  noncyclic  grammars  with  e- 
productions  was  also  observed  by  Nozohoor-Farshi  [35];  in  particular,  grammars  for  which 
3A  EN  such  that  A =*+aAf3  and  a=$+t  hold  in  G,  but  &=$*(.  does  not  hold,  are  focused  on. 
In  order  to  accept  grammars  of  this  kind,  a modification  to  Tomita’s  algorithm  is  proposed 
which  allows  cycles  in  the  graph-structured  stack.  The  basic  approach  to  handling  such 
cycles  is  outlined  as  follows:  when  a nonterminal  transition  is  installed  from  a state  qEUi 
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that  already  existed  in  the  graph,  all  states  in  Ui  which  were  previously  acted  on  are  recon- 
sidered to  see  if  any  reductions  from  them  pass  through  the  new  transition.  This  is 
apparently  sufficient,  but  the  details  of  how  it  is  accomplished  are  not  provided. 

The  worst-case  time  complexity  of  Tomita’s  algorithm  is  also  0(np+1)  [26].  In  com- 
parison, recall  that  the  complexity  of  Earley’s  algorithm  is  not  affected  by  the  length  of  pro- 
duction right-hand  sides.  Accompanying  the  complexity  analysis  by  Kipps  [26]  is  a modified 
version  of  Tomita’s  algorithm  that  has  a worst-case  running  time  in  0(n 3).  In  short,  addi- 
tional interstate  links  are  used  for  decreasing  the  number  of  paths  that  must  be  traversed 
when  performing  reductions.  However,  the  plethora  of  set-union  and  set-membership  opera- 
tions contained  in  the  algorithm  does  not  make  it  clear  that  0(n3)  time  is  obtained.  In  any 
case,  this  modification  subverts  the  algorithm’s  ability  to  construct  a parse  forest,  so  it  is 
only  useful  for  recognition. 


CHAPTER  Vn 

A GENERAL  BOTTOM-UP  PARSER 


The  GeneraLLRO  recognizer  is  extended  into  a general  bottom-up  parser  in  this 
chapter.  The  transformation  from  general  recognizer  to  general  parser  is  straightforward  in 
all  but  one  respect  — some  effort  must  be  expended  to  parse  arbitrary  derivations  of  the 
empty  string.  Briefly,  a parse  of  an  input  string  is  represented  by  appropriately  annotating 
the  transitions  of  the  recognition  graph.  Ambiguity  is  accommodated  by  attaching  multiple 
annotations  to  relevant  transitions.  As  usual,  an  arbitrary  reduced  $-augmented  grammar  G 
= (V,  T,P,S ) and  an  arbitrary  string  w=ala2  • • • an+1,  n >0,  a,-GT\{$}  for  1 < i < n , 
an+1  = $,  are  assumed  throughout. 

From  Recognition  to  Parsing 

Implementations  of  deterministic  bottom-up  parsers,  of  which  LR  parsers  are  exem- 
plary, are  not  obliged  to  build  an  explicit  parse  tree  for  the  input  string.  Whether  or  not  a 
parse  tree  is  indeed  constructed  is  primarily  dictated  by  the  requirements  of  the  application 
to  which  the  parser  is  applied.  Other  factors  which  are  influential  include  memory  con- 
straints and  the  interface  between  the  parser  and  other  processing  components. 

In  contrast,  general  bottom-up  parsers  typically  cannot  avoid  explicit  parse  tree 
representations.  When  parsing  against  a nondeterministic  grammar  a forest  of  parse  trees 
rather  than  an  identifiably  unique  tree  is  typically  relevant  to  the  input  string.  Due  to 
theoretical  limitations  on  the  discrimination  afforded  by  lookahead,  this  behavior  is  even 
observed  with  unambiguous  grammars.  In  any  case,  some  representation  of  the  parse  forest 
must  be  built  during  parsing  so  that  a unique  parse  can  eventually  be  produced. 
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In  light  of  these  observations,  the  parsing  version  of  General—LRO,  General_LR(y, 
overtly  maintains  a representation  of  a parse  forest.  The  manner  in  which  this  is  accom- 
plished is  a simple  generalization  of  the  following  proposed  scheme  for  explicitly  constructing 
a parse  tree  within  an  LR  parser. 

Suppose  that  G is  an  LR  grammar.  We  consider  a hypothetical  LR  parser  for  G and 
describe  one  way  to  explicitly  build  a parse  tree  for  an  input  string  in  conjunction  with  the 
parse  stack.  We  may  assume  that  the  parser  is  based  on  some  LR  automaton  for  G , say  M . 
At  any  point  during  a parse,  the  contents  of  the  stack  is  a sequence  of  states  from  M . The 
parse  tree  that  is  synthesized  during  parsing  is  represented  by  associating  a node  in  the  tree 
with  each  state  in  the  stack  other  than  the  bottom-most  state. 

Let  the  contents  of  the  stack  at  some  point  be  • • • sm,  m >0,  where  each  si  is  a 
state  of  M;  in  particular,  s0  is  the  start  state  of  M . For  1 <*  <m,  let  Xi  be  the  entry  sym- 
bol for  state  s ^ Thus,  X ±X2  ■ ■ ■ Xm  is  the  viable  prefix  of  G that  is  implicitly  represented 
by  the  supposed  stack  contents.  If  m =0,  the  relevant  viable  prefix  is  e.  For  1<«  <m,  we 
assume  that  some  representation  of  a parse  tree  node  labeled  with  A,  is  attached  to  the 
entry  for  s,  in  the  stack.  The  shift  and  reduce  actions  of  M generate  additional  tree  nodes 
as  follow's. 

A shift  action  always  creates  a new  leaf  node.  Suppose  that  the  current  input  symbol  is 
a and  the  next  action  of  the  parser  is  to  shift  a from  sm.  As  a result  of  this  action,  the  con- 
tents of  the  stack  becomes  s0s1  •••«„,<!  where  goto (sm,o)  = fj.  As  a side  effect,  a new 
parse  tree  node  is  generated,  labeled  with  a,  and  attached  to  tx  in  the  stack. 

A reduce  action  typically  generates  one  internal  node.  However,  when  reducing  by  an 
e-production,  a leaf  node  is  also  created.  Suppose  that  the  next  action  called  for  by  the 
parser  is  to  reduce  by  production  A— ►e.  This  action  transforms  the  contents  of  the  stack  to 
sosi  ' ' ' smt 2 where  goto(sm,A)  = t2.  Two  new  tree  nodes  are  generated  as  a side  effect. 
One  tree  node  is  a leaf  that  is  labeled  with  e.  The  second  is  an  internal  tree  node;  it  is 
labeled  with  A , set  to  point  to  the  new  leaf,  and  attached  to  t2. 
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Lastly,  suppose  that  the  next  action  called  for  by  the  parser  is  to  reduce  by  production 
A—*Xm_r  ■ ■ ■ Xm_1Xm,  r >0,  i.e.,  the  length  of  the  right-hand  side  is  strictly  greater  than 

0.  If  goto(  A ) = £3,  then  the  contents  of  the  stack  becomes  SqSj  • • • sm_r_lti  and  a 

new  tree  node  labeled  with  A is  attached  to  £3.  In  addition,  this  new  internal  node  is  set  to 
point  to  each  of  the  nodes  that  were  associated  with  the  states  sm_r,  . . . ,sm_x,sm  before 
the  reduction  was  made. 

Upon  accepting  the  input  string  w,  the  contents  of  the  stack  is  s0s's"  where  goto(s0,  S) 
= s'  and  goto(s',  $)  = s".  At  this  point,  the  root  of  the  parse  tree  for  axa2  • ■ • an  is 
attached  to  s'. 

A parse  forest  for  the  input  string  is  synthesized  by  General—LRO'  in  an  analogous 
fashion.  Specifically,  the  Shift  and  Reduce  functions  are  modified  to  annotate  the  recognition 
graph  with  information  sufficient  for  representing  the  parse  forest.  The  parse  annotations 
are  attached  to  the  transitions  of  the  recognition  graph  since  the  connectivity  of  the  graph, 

1. e.,  as  exhibited  through  the  transitions,  reflects  the  structure  of  the  parse  forest. 

Overlooking  many  of  the  details  that  are  supplied  later,  GeneraL-LRO7  constructs  a 
parse  forest  as  follows.  When  a transition  on  a G T is  created  by  Shift  a leaf  node  labeled 
with  a is  attached  to  that  transition.  A transition  that  is  created  by  Reduce  corresponds  to 
an  internal  node  of  the  parse  forest.  The  parse  annotation  attached  to  it  includes  pointers  to 
the  parse  annotations  associated  with  the  transitions  that  were  traversed  along  the  way 
toward  creating  that  transition  (i.e.,  the  transitions  traversed  in  the  computation  of  the  succ 
function).  The  transitions  created  by  Traverse  are  annotated  so  as  to  avoid  creating  circu- 
larities in  the  parse  forest  that  arise  due  to  unbounded  derivations  of  the  empty  string.  In 
short,  Traverse  resolves  all  ambiguous  derivations  of  e. 

A transition  that  is  multiply-defined,  i.e.,  due  to  ambiguity,  can  have  a distinct  parse 
annotation  attached  to  it  for  each  path  in  the  recognition  graph  that  reduced  to  that  transi- 
tion. In  this  way,  the  parse  forest  becomes  a factored  representation  of  all  possible  parse 
trees  for  the  input  string  (excluding  ambiguous  derivations  of  e).  However,  the  presentation 
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that  follows  is  simplified  by  assuming  that  ambiguities  are  resolved  as  soon  as  they  are 
detected.  Of  course,  the  ease  with  which  ambiguities  can  actually  be  resolved  is  dictated  by 
semantic  properties  of  the  language  generated  by  G . 

Parse  Annotations 

The  parse  forest  built  by  General_LR(y  is  maintained  through  information  that  is 
attached  to  the  transitions  of  the  recognition  graph.  These  attachments  have  already  been 
referred  to  as  parse  annotations.  The  notation  that  is  used  for  denoting  parse  annotations  is 
introduced  next.  For  simplicity,  only  one  parse  annotation  is  ever  attached  to  a given  transi- 
tion. 

The  Greek  letter  7r,  possibly  with  a subscript,  is  used  regularly  to  denote  parse  annota- 
tions. All  parse  annotations  are  enclosed  within  square  brackets.  Thus,  [7r]  is  a simple  exam- 
ple of  the  notation  used  to  denote  a parse  annotation. 

The  parse  annotation  for  a transition  on  a 6 T that  is  generated  by  Shift  is  denoted  by 
[a].  Conceptually,  this  annotation  is  some  descriptor  for  the  terminal  symbol  a.  A transi- 
tion on  A EN  that  is  generated  by  Traverse  as  the  result  of  a reduction  by  A— ►€  is  anno- 
tated with  [e],  i.e.,  a suitable  descriptor  for  the  empty  string.  The  notion  of  an  empty  parse 
annotation,  denoted  by  [],  is  also  useful;  note  that  this  annotation  is  distinct  from  [e]. 

The  parse  annotation  of  every  other  nonterminal  transition,  whether  generated  by 
Reduce  or  Traverse,  consists  of  a list  of  pointers  to  other  parse  annotations.  For  this  pur- 
pose, we  let  &7T  denote  a pointer  or  reference  to  the  parse  annotation  [7r]  (or  equivalently,  a 
pointer  to  the  transition  to  which  [?r]  is  attached).  Consider  a transition  on  A G7V  that  is 
generated  as  the  result  of  a reduction  by  production  A— ^XxX2  • • • Xm  €P,  m >1.  Suppose 
that  for  1 <»  [tt,]  is  the  parse  annotation  of  the  transition  on  X{  relevant  to  this  reduc- 

tion. Then  the  parse  annotation  that  is  attached  to  this  transition  on  A is 
[&7TJ, &jt2,  . . . ,&7Tm],  i.e.,  an  ordered  list  of  pointers  to  the  annotations  associated  with  the 
transitions  in  the  path  in  GR  that  spells  (A^X2  • • • Xm)R . 
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In  summary,  a parse  annotation  is  either  (1)  a descriptor  of  a terminal  symbol,  (2)  a 
descriptor  of  the  null  string,  or  (3)  a sequence  of  pointers  to  parse  annotations.  In  order  to 
reflect  the  close  connection  between  parse  annotations  and  recognition  graph  transitions,  the 
notation  used  to  specify  transitions  is  modified  slightly  as  follows.  Currently,  ( p,X,q ) 
denotes  a transition  in  6.  In  our  discussion  of  GeneraULRO',  this  transition  will  be  denoted 
by  the  quadruple  (p,X,  <7,[n])  where  [7r]  is  the  parse  annotation  of  (p,X,q).  Thus,  upon 
acceptance  of  the  input  string,  a parse  tree  for  it  can  be  recovered  from  the  grammar  sym- 
bols and  parse  annotations  that  are  associated  with  the  transitions  in  Gr. 

Parsing  the  Empty  String 

As  identified  in  Chapter  VI,  a transition  in  the  recognition  graph  of  the  form  (p  ,A,q) 
where  p,qEQx  for  some  i,  0<t<n,  is  due  to  a derivation  of  the  empty  string  from  A. 
Such  transitions  are  handled  in  a particularly  simple  fashion  by  the  Traverse  function  of 
GeneraLLRO  since  the  steps  in  the  derivation  are  not  relevant  to  recognition.  However,  in 
order  to  fulfill  its  role  as  a general  parser,  General_LRO'  must  be  able  to  reconstruct  a 
derivation  of  e from  A for  this  transition. 

Some  derivations  of  the  empty  string  are  especially  troublesome,  namely  those  which 
are  unbounded  in  length.  Unbounded  derivations  of  e are  caused  by  those  nonterminals 
A EN  for  which  A =$+ A =**e  holds  in  G.  General_LR(y  resolves  this  issue  by  disambiguat- 
ing every  ambiguous  derivation  of  e that  occurs  during  a parse.  The  Traverse  function  is 
modified  to  accomplished  this  task.  The  details  of  the  revised  Traverse  function  are  given  in 
the  next  section.  In  the  remainder  of  this  section,  we  introduce  some  notions  that  are  used  in 
that  later  discussion  of  Traverse. 

First,  we  define  W = {A  EN  \ A =>*e  holds  in  G }.  For  each  nonterminal  symbol 
A E W , Traverse  minimizes  the  length  of  derivations  of  e from  A.  Toward  that  end,  a parti- 
tion of  W is  defined  as  follows:  (l)  Wx  = {A  E W \ A — ► £ EP},  and  (2)  for  i > 1,  = 

{AEW  \A  ^1<M<  . W},  A^BxB2  ■ ■ ■ Bm  EP,  m>  1,  Bk  Wj,  1 <k  <m}.  For  each 
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A G W , A G VF,  if  and  only  if  i is  the  number  of  steps  in  a shortest  derivation  of  e from  A. 
Of  course,  only  those  subsets  Wi  for  which  Wi  #0  holds  are  of  interest.  For  each  A G IF, 
define  e-length(A)  = i if  and  only  if  A G IF,-.  Thus,  e-length(A)  denotes  the  length  of  a short- 
est derivation  of  e from  A . 

In  addition,  a unique  production  is  associated  with  each  A GIF;  this  production  is 
denoted  by  nuller(A ).  The  intent  is  for  nuller(A ) to  be  used  in  the  first  step  of  any  deriva- 
tion of  e from  A — or  rather  the  last-step  in  the  complimentary  bottom-up  parse  of  £.  By 
making  use  of  nuller(A),  ambiguous  derivations  of  e from  A,  if  they  are  possible  in  G,  are 
disambiguated  by  Traverse.  For  each  A G IF,  nuller(A)  is  defined  by  the  first  of  the  follow- 
ing two  rules  which  applies. 

(1)  If  A— +e£P,  then  nuller(A )=A— ► e. 

(2)  Otherwise,  nuller(A)  = A— ► J9152  ' ' ' Bm  for  some  A—tByB^  • • • Bm£P, 

m >1,  such  that  e-length(A)  = 1 + e-length(By). 

1 <j<m 

For  each  A G W,  there  is  a derivation  of  e from  A consisting  of  i steps  in  which  the  first  step 
is  an  application  of  nuller(A).  If  nuller(A)  is  determined  by  rule  (2)  above,  then  more  than 
one  production  may  apply.  In  this  case,  an  arbitrary  choice  can  be  made.  Alternatively, 
some  criteria  may  be  applied  toward  making  this  choice  more  purposeful,  e.g.,  that  which 
minimizes  m or  the  height  of  the  resulting  subparse  tree. 

Before  concluding  this  section,  some  motivation  for  disambiguating  all  derivations  of  e 
is  provided.  Suppose  that  A G W derives  e in  more  than  one  way.  Then  if  some  derivation 
of  e from  A is  a segment  of  a parse  for  the  input  string,  then  any  derivation  of  e from  A may 
be  substituted  for  this  segment.  In  particular,  this  substitution  may  be  made  independently 
of  the  context  in  which  the  segment  occurs  in  the  complete  parse.  If  one  derivation  of  e from 
A is  preferred  in  a given  context,  either  the  grammar  must  be  modified  to  account  for  this  or 
else  the  favored  derivation  must  be  specified  by  some  context-sensitive  means.  Since 
context-sensitive  extensions  to  context-free  grammars  are  beyond  the  scope  of  this  work,  we 
choose  to  disambiguate  all  parses  of  e so  as  to  minimize  derivation  lengths. 
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The  General— LRO  Parser 

The  General—LRO  parser  is  described  next.  For  reference,  the  parser  is  rendered  in 
pseudocode  in  Figure  7.1  (spanning  three  pages).  The  discussion  focuses  on  the  modifications 
made  to  the  recognizer  in  deriving  the  parser.  For  the  most  part,  the  changes  are  rather 
minor.  However,  the  Traverse  function  underwent  substantial  revision  in  order  to  correctly 
handle  arbitrary  derivations  of  the  empty  string. 


1.  function  General_LRO/(G!  =(V,  T,  P,  5);  uiGT) 

2.  //  w=a1a2  ■ ■ • an+1,  n >0,  a,-6r\{$},  l<*'<n,  an+1=$ 

3.  //  Let  MC(G)=(I,  V,  goto,  70, 7)  be  the  LR(0)  automaton  for  G. 

4.  //  GR(MC)=(Q , V,  <5)  is  an  STG,  the  recognition  graph. 

5.  Q,  6 :=  {g0.0},  0 //  Initialize  GR. 

6.  //  Let  Mr  ={Gr\ q0.0,  Q0).  Then  L(MR)  = PVP(G, e)  = {e}. 

7.  for  i 0 to  n do 

8.  //  Let  Mr  ={Gr\ q0:0,  Qt).  Then  L(A^)  = PVP(G,  i:w). 


9. 

Reduce  (:) 

10. 

//  Let  Mr  ={Gr\  q0:0,  Q,-).  Then  L(MR)  = VP(G,  i:w). 

11. 

Shift  (j) 

12. 

//  Let  MB  ={Gr  \ q0;0,  Qi+l).  Then  L (MR)  = PVP (G,  i+l :w). 

13. 

if  §)+1=0then  Reject(u;)  fi 

14. 

od 

15. 

//  Let  Mg  =(Gr\  <70:0,  Qn+ 1).  Then  L (MR)  = PVP (G,  w)  = {5$}. 

16. 

Accept(u;) 

17. 

end 

18. 

function  Shift (t) 

19. 

Q_subset  :=  {q  G Q{  | goto (i{q\  a,+1) 

is  defined  } 

20. 

while  Q_subset  ^ 0 do 

21. 

q :=  Remove(Q-subset) 

//  Let  goto a,+1)  = 7y 

22. 

if  qj:i+i  & Q then 

23. 

Q :=  Q U{g;:i+1} 

24. 

fi 

25. 

6 :=  <5U{(g;:i+1,  a,+1,  q,  [a,+1])} 

//  Never  redundant. 

26. 

od 

27. 

end 

Figure  7.1  — The  GeneraLLRO  Parser 


(Line  1)  The  main  function  of  the  parser  is  named  GeneraL-LRO7.  In  all  other  respects, 
this  function  is  identical  to  the  main  function  of  the  recognizer. 
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(25)  The  transitions  installed  by  Shift  are  assigned  appropriate  parse  annotations.  As 
described  earlier,  the  parse  annotation  for  al+1Gr  is  denoted  by  [a1+1]. 

(34)  This  line  reflects  the  new  form  taken  by  the  transitions  of  the  recognition  graph. 


28. 

29. 

30. 

31. 

32. 

33. 

34. 

35. 

36. 

37. 

38. 

39. 

40. 

41. 

42. 

43. 

44. 

45. 

46. 

47. 

48. 

49. 

50. 

51. 

52. 

53. 

54. 

55. 

56. 

57. 

58. 

59. 


function  Reduce  (i) 

<5_subset  :=  6i 
Traverse(<3,-,  t) 

Succ_Stack  :=0 

while  Succ_Stack  /0or  <5_subset  / 0d o 
if  Succ_Stack  = 0 then 

(p  ,X,  q,  [7r])  :=  Remove(<5_subset) 

for  A — ► aX  • ft G rj^p ) such  that  /?  =**  e do 

/ / Let  [7^]  be  the  parse  annotation  for  /?. 
Push(Succ_Stack,  (q , A , len(o),  [&7T,  7r^])) 
od 

else  / / Succ_Stack  0 

{r,A,d,[ 7Tj])  :=  Pop(Succ_Stack) 

if  d > 0 then  / / Let  X — entry (V(r )). 

for  r' &Q  such  that  (r,  A^,  r1,  [tt^G^  do 

Push(Succ_Stack,  (r',A,  d—  1,  [&7r2,  7Tj])) 
od 

else  //  d =0,  let  goto(V{r), A)  = 7;-. 

if  Qj -i  & Q then 

Q :=GU{*y:i} 

Traverse({?;:,},i) 

fi 

if  (g;-.,-,  A , r,  [tt])^<5  for  any  [7r]  then 
5:=6U{(«y:i,Alr,[?r1])} 

(Lsubset  :=  <5_subset  {J{(qyi,A , r,  [ttJ)} 
else  //  Let  (g-:i,A,  r,  hold  for  some  [7r2], 

Disambiguate((?;:i,  A,  r,  [wj),  {qj:i,A,r,  [ttJ)) 
fi 
fi 
fi 
od 

end 


Figure  7.1  — continued 


(35-38)  As  in  the  recognizer,  we  need  to  initiate  all  relevant  reductions  from  p by  push- 
ing appropriate  entries  onto  Succ_Stack.  However,  the  computation  of  the  succ  function 
that  is  carried  out  here  must  also  construct  parse  annotations  for  the  transitions  installed  by 
Reduce.  A fourth  field  is  added  to  each  entry  in  Succ_Stack  for  this  purpose.  In  short,  this 
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60. 

61. 

62. 

63. 

64. 

65. 

66. 

67. 

68. 

69. 

70. 

71. 

72. 


function  Traverse(Q-subset,  :') 

Q-subset'  :=  Q_subset 
while  Q_subset  /0do 

q :=  Remove(Q_subset) 
for  gotc = Ij  such  that  A =**e  do 
if  qj.i  Q then 

Q :=QU{?J-,} 

Q-subset  :=Q_subset  U{fy.,} 
Q-subset'  :=  Q_subset'  U{<7j:t} 

fi 

Insert(<5_sorted_list,  (g-.,-,  A , q)) 

od 

od 


//  A €7V 


//  Never  redundant. 


73.  while  <5_sorted_list  #0do 

74.  (p , A , q)  :=  Remove_head(<5_sorted_list) 

75.  if  nuller(A ) = A — * e then 

76.  <5:=<5U{(p,A,?,[e])} 

77.  else  //  Let  nuller(A)  = A— ' ' ' Bm,  m >1. 

78.  //3  a path  ( qm , qm_v  . . . ,qvq)  in  GR  spelling  ■ ■ ■ Bx 

79.  //  i.e.,  {q:,Bjt  ?;_1,  (q1,B1,q,  [jrjjetf,  m >/  >2,  for  some  7 ry. 

80.  ^^{(p^gJ&Tr^&Trg,  . . . ,&7Tm])} 

81.  fi 

82.  od 


83. 

84. 

85. 

86. 

87. 

88. 

89. 

90. 

91. 

92. 

93. 

94. 

95. 


end 


while  Q-subset'  ^0do 

q :=  Remove(Q_subset;) 

for  A —KyX»/3£rl^q)  such  that  /?=►*€  do 

if  P—t  then 

Let  the  parse  annotation  for  be  [ 
else  //  Let  ^=BlB2  • ■ • Bm,  m >1. 

//3  a path  (qm,  . . . ,qv  q)  in  GR  spelling  BmBm_1  ■ ■ ■ Bv  i.e., 

//  <Ij-V  [*>])»  9,  K])^,  m >J  >2,  for  S°me  *j- 

Let  the  parse  annotation  for  /?  be  f&TTj,  &7r2,  • ■ • 

fi 

od 

od 


Figure  7.1  — continued 


field  is  used  for  storing  the  parse  annotation  corresponding  to  the  path  traversed  so  far  in  the 
course  of  making  a reduction.  Consider  the  reduction  from  p on  the  production  A—*-aXfi 
where  /?=**€  holds  in  G.  The  parse  annotation  of  every  transition  on  A that  results  from 
this  reduction  will  include  a pointer  to  the  parse  annotation  of  the  transition  on  X from  p to 
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q,  namely  &7T.  In  addition,  it  must  include  the  parse  annotation  relevant  to  the  nullable 
suffix  ft,  referred  to  here  as  [7^].  One  of  the  tasks  of  Traverse  is  to  compute  [7^]  and  associ- 
ate it  with  the  item  A — +•  aX • j3  of  4{p)]  in  particular,  Traverse  will  have  done  this  by  the 
time  this  reduction  is  made.  Thus,  the  parse  annotation  [& 7r,  7T^]  is  the  fourth  field  of  the 
entry  pushed  onto  Succ_Stack  that  corresponds  to  this  reduction. 

(40)  This  line  reflects  the  new  form  of  the  Succ_Stack  entries.  At  this  point  7Tj 
represents  a nonempty  sequence  of  pointers  to  parse  annotations.  These  parse  annotations 
correspond  to  the  suffix  of  some  production  right-hand  side  that  is  being  reduced  to  A . 

(42-44)  This  loop  demonstrates  how  parse  annotations  are  built  up  during  the  course  of 
computing  the  succ  function.  For  every  transition  ( r,X , r',  [^J)  that  is  traversed  within  this 
loop,  a pointer  to  together  with  the  parse  annotation  built  up  so  far,  [7^],  becomes  part 
of  the  parse  annotation  for  the  transition  on  A that  is  eventually  installed  in  the  recognition 
graph.  Thus,  [&7T2,  7Tj]  is  the  fourth  field  of  the  appropriate  entry  pushed  on  the  Succ_Stack. 

(50-55)  If  (<7;:i,  A,  r,  [7t])^<5  for  any  parse  annotation  [7r],  we  proceed  as  before.  The 
transition  (qj:i,  A , r,  [7^])  is  installed  in  Gr  and  added  to  (Lsubset  to  allow  for  subsequent 
reductions  back  through  it.  Note  that  at  this  point  7rx  represents  a nonempty  sequence  of 
pointers  to  parse  annotations  corresponding  to  the  right-hand  side  of  some  production  that 
has  been  reduced  to  A ; more  specifically,  the  sequence  of  pointers  corresponds  to  a path  in 
Gft1  that  spells  that  right-hand  side.  On  the  other  hand,  if  (g;:i,A,  r,  [77^)6^  for  some  parse 
annotation  [7^,  then  an  ambiguity  has  been  detected.  The  Disambiguate  function  is 
invoked,  the  details  of  which  are  not  specified  here,  to  decide  which  parse  annotation  out  of 
[7^]  and  [7^  to  retain  with  the  transition. 

It  is  apparent  from  Figure  7.1  that  the  Traverse  function  is  substantially  more  exten- 
sive than  before.  It  now  consists  of  three  while  loops.  Each  loop  is  discussed  in  turn. 

The  first  while  loop  is  very  similar  to  the  single  while  loop  contained  in  the  version  of 
Traverse  used  by  the  General_LR0  recognizer.  Two  new  lines  have  been  added  and  one  line 


has  been  modified. 
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(61,68)  The  set  variable  Q_subset'  is  initialized  to  the  contents  of  Q_subset  in  line  61. 
In  line  68,  each  new  state  that  is  added  to  Q within  the  first  while  loop  is  also  added  to 
Q-subset'.  The  states  contained  in  Q-Subset'  after  the  first  loop  completes  are  processed 
later  in  the  third  while  loop. 

(70)  The  transitions  on  nullable  nonterminals  are  not  directly  added  to  <5  as  before. 
Instead,  they  are  entered  into  a list  called  <5_sorted_list.  The  elements  of  the  form  (p,A,q) 
in  <5_sorted_list  are  sorted  in  order  of  increasing  e-length(A).  The  contents  of  <5_sorted_list 
are  processed  by  the  second  while  loop. 

Within  the  second  while  loop,  an  appropriate  parse  annotation  is  determined  for  each 
element  in  <5_sorted_list  and  the  annotated  transitions  are  installed  into  the  recognition 
graph.  The  parse  annotation  assigned  to  (p  ,A,  q)  is  determined  by  nuller(A). 

(73)  Each  element  in  <5_sorted_list  is  considered  in  turn.  No  additional  elements  are 
added  to  <5_sorted_list  within  this  loop. 

(74)  The  element  ( p,A,q ) at  the  head  of  <5_sorted_list  is  removed.  At  this  point,  we 
know  that  e-length(A)  > e-length(A')  for  each  element  ( p',A q1)  removed  from  (LsortedJist 
in  an  earlier  iteration  of  the  loop. 

(75-76)  Suppose  that  nuller(A)  = A— *■€.  Then  [e]  is  the  appropriate  parse  annotation 
for  (p , A,  q).  Thus,  the  transition  (p , A,  q,  [e])  is  added  to  6. 

(77-80)  Otherwise,  nuller(A)  = A—^B^B^  • • ■ Bm  for  some  production 

A—*B1B2  ■ ■ Bm  where  m >1.  This  implies  that  e-length(i?,)  < e-length(A)  holds  for  each 
B{ . Since  <Lsorted_list  was  sorted  in  order  of  increasing  e-length,  an  annotated  transition  on 
each  Bj  has  already  been  installed  in  GR.  In  particular,  there  must  be  a path 
{qm,  <7m- v • ■ ■ ><h’  v)  in  Gr  which  spells  BmBm_1  ■ ■ ■ Bv  The  transitions  in  this  path  are  of 
the  form  (qj,Bj,  qj_v  [7r;]),  (qv  Bv  q,  [^i])G^,  m >j  >2,  for  some  parse  annotations  [tt-].  In 
this  case,  (p,A,  q,  [&7TJ,  &jr2,  . . . ,&7Tm])  is  the  appropriate  transition  to  add  to  <5. 

The  third  while  loop  processes  the  states  contained  in  Q_subset'.  In  particular,  for 
each  state  p in  Q_subset'  and  each  item  of  the  form  A — *aX • 0(E.rJj(p)  such  that  /?=»•*£  holds 
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in  G,  this  loop  determines  an  appropriate  parse  annotation  to  associate  with  the  nullable 
suffix  0.  Thus,  p is  readied  for  any  reductions  that  are  initiated  from  it  in  the  for  loop  at 
line  35  of  the  Reduce  function.  Note  that  at  this  point  none  of  the  states  in  Q_subset'  have 
had  reductions  made  from  them  yet. 

(83)  Each  state  in  Q_subset'  is  considered  in  turn.  No  new  states  are  added  to 
Q_subset'  within  the  loop. 

(84)  A state  q is  removed  from  Q_subset'. 

(85)  For  each  A — *-aX • such  that  /?=**€  holds  in  G,  we  want  to  associate  a 

parse  annotation  to  the  nullable  suffix  /?.  This  becomes  the  parse  annotation  [7T^]  that  is 
referred  to  in  lines  36-37  of  the  Reduce  function. 

(86-87)  If  /?=£,  then  the  appropriate  parse  annotation  to  associate  with  ft  is  []. 

(88-91)  Otherwise,  /9=R1R2  • ■ • Bm  for  some  Bj  E W and  m >1.  Due  to  the  process- 
ing done  in  the  second  while  loop,  there  is  a path  (qm,  qm_i,  . . . ,qv  q)  in  GR  which  spells 
■ ■ ■ Bv  Let  {qj,Bj,  qj_v  [fly]),  {qvBv  q,  K])G<5,  m >j  >2,  for  some  parse  annota- 
tions fly  be  the  transitions  in  that  path.  Then  the  appropriate  parse  annotation  to  associate 
with  /?  in  this  case  is  [&7TJ,  &7T2,  . . . ,8c nm\. 

The  Complexity  of  Parsing 

Worst-case  complexity  bounds  for  the  General_LR0  parser  are  easily  derived  from  the 
complexity  bounds  of  the  recognizer.  In  the  following,  we  assume  that  General_LR(y  is 
applied  to  G and  w and  that  w EL(G)  holds.  Space  bounds  are  examined  first. 

The  size  of  a parse  annotation  is  bounded  by  some  constant,  e.g.,  the  length  p of  the 
longest  production  right-hand  side.  If,  as  assumed,  ambiguities  are  resolved  when  they  are 
first  detected,  only  one  parse  annotation  is  ever  attached  to  a given  transition  in  GR . Thus, 
the  space  complexity  of  the  parser  is  the  same  as  the  space  complexity  of  the  recognizer. 
That  is,  the  space  complexity  of  General_LR(y  is  0(n2)  if  G is  arbitrary,  or  unambiguous 
but  otherwise  arbitrary,  and  it  is  0(n ) if  G is  LR(&)  and  k -symbol  lookahead  is  employed. 
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The  <Lsorted_list  that  is  used  by  the  parser’s  version  of  Traverse  contains  at  most  m2 
entries  at  any  time.  Thus,  its  use  does  not  affect  the  space  complexity  of  parsing. 

In  the  other  extreme,  the  resolution  of  all  ambiguities  discovered  by  Reduce  is  delayed 
until  after  the  input  string  is  accepted.  Under  this  scenario,  one  parse  annotation  is  attached 
to  a nonterminal  transition  for  each  path  in  GR  that  reduces  to  that  transition.  In  this  case, 
the  space  complexity  of  the  parser  is  the  same  as  the  time  complexity  of  the  recognizer,  i.e., 
0(np+l). 

Next,  the  time  complexity  of  the  parser  is  considered.  The  most  substantial  differences 
between  the  parser  and  the  recognizer  lie  with  the  manufacture  of  parse  annotations  and  the 
Traverse  function.  The  amount  of  work  done  within  each  invocation  of  Traverse  is  bounded 
by  constant  factors  that  are  related  to  the  size  of  MC(G).  Since  Traverse  is  called  at  most 
m times  within  any  invocation  of  Reduce,  the  more  complicated  Traverse  function  used  by 
GeneraLLRty  does  not  increase  the  time  complexity  of  parsing  with  respect  to  recognition. 
Moreover,  the  operations  related  to  constructing  parse  annotations  can  clearly  be  done  in  a 
constant  amount  of  time.  Therefore,  the  worst-case  time  complexity  of  the  parser  is  0(np+1) 
if  G is  arbitrary  and  0(n2)  if  G is  unambiguous.  In  addition,  LR(&)  grammars  can  be 
parsed  in  linear  time  provided  that  A>symbol  lookahead  is  used. 

Since  the  Disambiguate  function  has  not  been  specified,  its  impact  on  the  time  complex- 
ity of  parsing  cannot  be  assessed.  In  that  respect,  the  above  analyses  implicitly  assume  that 
the  Disambiguate  function  runs  in  constant  time.  However,  if  more  costly  mechanisms  are 
required  for  resolving  ambiguity,  the  time  consumed  by  them  must  be  accounted  for. 

Garbage  Collection  Revisited 

Lookahead  can  be  employed  within  GeneraLLRO'  exactly  as  in  GeneraLLRO.  How- 
ever, the  garbage  collection  procedure  proposed  for  GeneraLLRO  is  too  simplistic  for  the 
parser.  The  underlying  reason  for  this  lies  with  the  manner  in  which  the  parse  forest  is 
superimposed  on  the  recognition  graph. 
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Consider  a point  during  the  parse  of  an  input  string  at  which  we  would  like  to  perform 
garbage  collection.  If  the  garbage  collection  procedure  proposed  for  General_LR0  is  applied, 
the  recognition  graph  may  be  contracted  more  than  is  desired  for  parsing.  Specifically,  tran- 
sitions may  be  deleted  from  whose  parse  annotations  are  part  of  the  parse  forest  relevant 
to  the  prefix  of  the  input  string  analyzed  to  that  point.  The  marking  phase  of  the  garbage 
collection  procedure  must  be  modified  accordingly  to  correct  for  this. 

Consider  the  recognition  graph  just  prior  to  performing  garbage  collection.  Informally, 
we  will  refer  to  the  states  in  GR  that  are  not  deleted  by  our  original  garbage  collection  pro- 
cedure as  being  essential  to  recognition.  The  states  in  GR  that  are  essential  to  parsing  are 
defined  inductively  as  follows. 

(1)  If  p £ Q is  essential  to  recognition,  then  p is  essential  to  parsing. 

(2)  If  p GlQ  is  essential  to  parsing  and  entry(p)  =A  for  some  A £N,  for  every  transi- 
tion {p,A,  q,  [tt])G<5  where  [ff]  = &7r2,  . . . ,&7rm],  m>l,  let 

(r,X,  s,  [?rm])€:<5  be  the  rightmost  transition  referenced  in  [zr] . Then  r and  all 
states  reachable  from  r are  essential  to  parsing. 

The  marking  phase  of  the  garbage  collection  procedure  must  be  modified  so  as  to  mark 
all  states  in  GR  that  are  essential  to  parsing.  In  order  to  accomplish  this,  certain  branches  of 
the  parse  forest  must  be  traversed  according  to  the  inductive  definition  given  above.  The 
second  step  of  the  garbage  collection  procedure,  that  which  deletes  unmarked  states  and 
their  out-going  transitions,  remains  unchanged. 

Discussion 

The  GeneraLLRO  recognizer  was  extended  into  a general  context-free  parser.  The 
parse  forest  constructed  by  General—LRC/  is  represented  by  attaching  appropriate  parse 
annotations  to  the  transitions  of  Gr-  In  effect,  the  parse  forest  is  superimposed  on  the  recog- 
nition graph. 
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Only  minor  modifications  were  required  of  the  Shift  and  Reduce  functions  in  order  to 
accommodate  parsing.  The  Traverse  function,  on  the  other  hand,  was  changed  substantially. 
It  is  important  to  note  that  Traverse  can  handle  the  most  ill-formed  grammars.  For  exam- 
ple, consider  the  grammar  with  the  production  set  P = {S0— ► £,  S0— >a,  S0-+S\,  Sj— ►S'* 

, , Sk_l— ►S'o}  f°r  some  k>  1.  This  grammar  was  submitted  by  Graham  et  al.  [20,  p. 

429]  as  an  example  of  a particularly  bad  worst-case.  Although  this  is  a contrived  example, 
the  ability  to  effectively  deal  with  pathological  conditions  if  and  when  they  arise  is  valuable 
from  both  a theoretical  and  practical  standpoint.  Toward  that  end,  the  Traverse  function 
handles  the  worst  situations  in  a fairly  straightforward  manner.  Nevertheless,  Traverse  can 
be  tailored  to  meet  the  specific  requirements  of  the  subject  grammar  if  the  generality  it  pro- 
vides is  not  needed. 

A parse  annotation  for  a nonterminal  transition  is  manufactured  as  a sequence  of 
pointers  to  the  parse  annotations  that  are  encountered  while  a path  is  traversed  during  a 
reduction.  Tomita’s  algorithm  performs  similar  operations  to  construct  a parse  forest.  In  his 
parsing  algorithm,  the  symbol  vertices  of  the  recognizer  are  used  for  storing  pointers  to  the 
nodes  of  the  parse  forest.  Of  course,  the  complexity  introduced  into  the  recognizer  by  the 
symbol  vertices  and  the  ad  hoc  manner  in  which  e-productions  are  handled  carry  over  to  the 
parser. 

In  Tomita’s  algorithm,  the  parse  forest  is  built  separately  from  the  graph-structured 
stack.  GeneraLLRCy  constructs  the  parse  forest  more  or  less  on  top  of  the  recognition  graph, 
but  could  just  as  easily  build  the  parse  forest  separately  as  well.  The  choice  that  is  made  for 
an  actual  implementation  primarily  has  implications  on  garbage  collection. 

The  worst-case  time  complexity  of  GeneraLLRO'  matches  that  of  GeneraLLRO.  With 
respect  to  GeneraLLRO',  the  expression  n^1  reflects  the  time  required,  in  the  worst-case,  to 
construct  a direct  representation  of  the  parse  forest.  Thus,  the  relative  inefficiency  of 
GeneraLLRO  as  compared  to  Earley’s  recognizer  is  offset  by  the  benefits  accrued  by 
GeneraLLRO'.  Specifically,  the  traversals  that  are  required  to  produce  a parse  and  to  resolve 
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ambiguities  are  made  convenient  by  the  structure  of  the  parse  forest.  In  contrast,  Earley’s 
parser  produces  a rather  indirect  representation  of  the  parse  forest.  Little  is  said  in  the 
literature  of  how  this  affects  the  ease  with  which  a parse  is  produced  or  with  which  ambigui- 
ties are  resolved  by  Earley’s  parser. 

The  hypothetical  Disambiguate  function  referred  to  in  Figure  7.1  allowed  us  to  keep 
the  specification  of  transitions  simple.  By  assumption,  Disambiguate  resolved  ambiguities  at 
the  point  where  they  were  first  detected,  so  only  one  parse  annotation  was  ever  attached  to  a 
given  transition  in  GR.  Of  course,  this  assumption  is  unrealistic  in  the  general  case.  A sub- 
stantive treatment  of  ambiguity  and  its  resolution  is  well  beyond  the  scope  of  this  work. 
However,  the  following  very  basic  observations  may  be  made. 

The  task  confronted  by  Disambiguate  in  line  54  of  Figure  7.1  is  to  determine  which 
transition  out  of  A , r,  [jtJ)  and  (g;- A , r,  [7r2])  to  retain  in  GR.  In  order  of  increasing 
complexity,  a selection  may  be  made  based  on  the  following  strategies. 

(1)  Through  a direct  comparison  of  [7Tj]  and  [ttJ. 

(2)  A combination  of  (1)  and  an  analysis  of  the  subparse  trees  referred  to  by  [ffj  and 
[xj,  respectively. 

(3)  An  analysis  of  the  surrounding  context  in  combination  with  (1)  and  (2). 

The  Disambiguate  function  could  conceivably  resolve  ambiguities  that  entailed  analyses  of 
type  (1)  or  (2)  above.  On  the  other  hand,  ambiguities  requiring  type  (3)  analysis  would  have 
to  be  postponed  until  later  in  the  parse  if  they  depended  on  right  context.  Some  simple 
approaches  to  handling  ambiguity  are  described  by  Aho  et  al.  [2],  Earley  [15],  Tarhio  [41], 
and  Wharton  [45], 


CHAPTER  VIII 
CONCLUSION 


Summary  of  Main  Results 

The  first  part  of  this  work  presented  a framework  for  describing  general  canonical 
context-free  recognition.  The  framework  has  a structurally  simple  mathematical  foundation. 
The  essence  of  general  canonical  recognition  was  captured  using  a small  number  of  binary 
relations  and  basic  set-theoretic  concepts.  Each  general  recognition  scheme  that  was 
presented  followed  the  same  script  while  exploiting  inherent  properties  of  viable  prefixes. 
Specifically,  general  recognition  was  reduced  to  computing  a sequence  of  regular  sets  in  each 
case.  Regularity-preserving  relations  were  applied  to  effect  the  set-to-set  mappings.  Our 
characterization  of  general  recognition  is  novel  and  rather  elegant.  Its  clarity  and  simplicity 
confirm  that  viable  prefixes  are  especially  suitable  bases  for  general  recognition.  Moreover, 
our  framework  offers  a conceptual  breakthrough  toward  a better  understanding  of  the 
quintessence  of  general  canonical  recognition. 

Earley’s  algorithm  proved  an  especially  fitting  vehicle  for  demonstrating  the  efficacy  of 
the  General_LR  and  General_LL  recognition  schemes.  In  particular,  our  graphical  variant  of 
Earley’s  recognizer,  Earley',  illustrated  one  way  of  realizing  explicit  representations  for  the 
sets  of  viable  prefixes  and  viable  suffixes  that  are  tracked  by  these  two  complementary 
schemes.  The  fact  that  General_LR  is  directly  manifested  by  Earley'  led  us  to  conclude  that 
it  is  more  appropriate  to  interpret  Earley’s  algorithm  as  a bottom-up  method  rather  than  a 
top-down  one.  Regardless  of  which  interpretation  one  favors,  Earley'  provided  much  new 
insight  into  Earley’s  algorithm.  Specifically,  a deeper  understanding  of  Earley’s  algorithm 
was  gained  and  its  relationship  with  LR  parsers  was  clarified. 
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The  last  two  chapters  were  devoted  to  describing  practical  recognizers  and  parsers  that 
are  derived  from  the  General_LR  recognition  scheme.  Automata-based  versions  of 
General_LR  are  obtained  by  using  an  automaton  that  accepts  VP(G)  to  guide  the  construc- 
tion of  a state-transition  graph,  the  recognition  graph.  The  recognition  graph  explicitly 
represents  the  sets  of  viable  prefixes  that  are  computed  by  General_LR.  In  the  discussion  of 
the  algorithms,  LR(0)  and  NLR(O)  automata  were  used  as  control  automata.  However,  other 
choices  are  possible  such  as  automata  that  are  intermediate  between  the  LR(0)  and  NLR(O) 
automata  as  well  as  automata  that  are  attributed  with  lookahead.  The  General_LR0  parser 
can  process  arbitrary  reduced  context-free  grammars.  To  accommodate  especially  ill- 
designed  grammars,  simple  means  for  dealing  with  pathological  grammar  properties  were 
presented.  Finally,  the  parse  forest  representation  used  by  the  General_LR0  parser  is  easy 
to  understand  and  convenient  for  handling  ambiguity. 

We  have  included  some  discussion  of  how  the  Earley  and  Tomita  algorithms  compare 
to  ours.  Although  the  0(np+1)  worst-case  time  complexity  of  the  General_LR0  recognizer 
does  not  compare  favorably  with  the  0(n3)  worst-case  complexity  of  Earley’s  recognizer,  it 
is  expected  that  General_LR0  would  outperform  Earley’s  algorithm  in  most  practical  situa- 
tions. Moreover,  it  is  more  convenient  to  work  with  the  representation  of  the  parse  forest 
that  is  used  in  our  framework.  The  General_LR0  algorithm  is  in  the  same  complexity  class 
as  Tomita’s  algorithm.  This  is  not  a surprising  result  given  the  similarities  between  the  two. 
However,  our  algorithm  can  parse  any  reduced  grammar.  Thus,  we  have  generalized 
Tomita’s  algorithm;  ironically,  our  general  algorithm  is  also  simpler  than  Tomita’s.  Lastly, 
our  framework  provides  some  firm  theoretical  justification  for  Tomita-like  parsers.  Tomita’s 
algorithm  is  notably  lacking  in  that  respect  in  that  it  is  more  of  an  ad  hoc  generalization  of 
the  standard  LR  parsing  algorithm. 

The  general  parsers  derived  in  our  framework,  viz.,  the  GeneraLLRO  parser  and  its 
variants,  are  appropriate  to  areas  of  application  which  require  more  flexible  parsers  than  are 
provided  within  the  confines  of  LR  parsing  theory.  In  a more  general  sense,  our  work  pro- 
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vides  a basis  from  which  many  issues  relating  to  context-free  recognition  and  parsing  may  be 
further  investigated.  Most  notably,  our  viable  prefix-based  model  of  recognition  and  parsing 
offers  a particularly  appropriate  framework  within  which  a broad  spectrum  of  related  parsing 
strategies  — LR  parsers,  the  Earley  and  Tomita  algorithms,  and  our  general  parsers  — may 
be  further  studied  and  compared. 

Directions  for  Future  Research 

Before  concluding,  we  suggest  some  possible  directions  for  further  research.  There  are 
several  worthwhile  prospects.  Of  course,  it  is  assumed  that  the  framework  laid  down  herein 
would  be  used  as  a starting  point  for  the  endeavors  described  below. 

Several  automat  a- based  versions  of  the  General_LR  recognition  scheme  were  con- 
sidered. Specifically,  concrete  realizations  of  GeneraLLR  were  born  out  by  the  Earley', 
GeneraLLRO,  and  General_NLR0  recognizers.  The  other  left-to-right  recognition  scheme, 
GeneraLXL,  was  mimicked  by  Earley'  in  a rather  obscure  fashion.  The  automata-theoretic 
aspects  of  General_LL  should  be  investigated  to  determine  more  direct  means  for  tracking 
the  sets  of  viable  suffixes  that  are  computed  by  it.  Our  preliminary  findings  along  this  line 
indicate  that  an  automata-based  GeneraLLL  recognizer  that  runs  in  0(n3)  time  in  the  worst 
case  is  indeed  attainable.  That  is,  the  time  complexity  does  not  depend  on  the  length  of  pro- 
duction right-hand  sides  as  is  the  case  with  GeneraL_LR0.  However,  we  were  unable  to 
extend  this  general  viable  suffix-based  recognizer  into  a parser,  so  further  study  of  this  issue 
was  suspended. 

It  is  expected  that  a pursuit  of  the  following  three  topics  would  benefit  from  experi- 
menting with  actual  implementations. 

(1)  Ascertain  a more  precise  characterization  of  the  0(n2)  time  and  0(n ) time  gram- 
mar classes.  It  is  well-known  that  Earley’s  algorithm  recognizes  grammars  with 
bounded  ambiguity  in  quadratic  time;  moreover,  even  some  ambiguous  grammars 
are  recognized  in  linear  time. 
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(2)  Consider  alternate  control  automata  for  implementing  the  GeneraL_LR  recogni- 
tion scheme  (including  automata  that  are  attributed  with  lookahead).  We  have 
already  suggested  employing  automata  that  are  intermediate  between  NLR(O) 
automata  and  LR(0)  automata. 

(3)  Identify  means  for  classifying  ambiguity  and  investigate  disambiguation  strategies. 

As  described,  the  GeneraLLRO  parser  produces  a parse  of  the  input  string  only  after 

the  string  is  accepted,  i.e.,  like  Earley’s  algorithm.  It  would  be  advantageous  to  be  able  to 
obtain  parse  fragments  as  soon  as  they  are  known  to  be  part  of  a final  parse.  The  parser 
would  then  behave  more  like  an  extended  LR  parser.  The  GeneraLJLRO  parser  should  be 
modified  to  provide  for  such  a piecemeal  delivery  of  a parse.  Note  that  such  a mechanism 
would  have  implications  on  garbage  collection. 

The  0(np+1)  worst-case  time  complexity  of  General_LR0  compares  unfavorably  with 
Earley’s  algorithm.  The  last  topic  that  we  suggest  addresses  this.  A grammar  is  in  canoni- 
cal two- form  if  its  productions  are  of  the  forms  A—*BC,  A—*B,  A— ► a,  and  A— ► € [39]. 
Clearly,  every  canonical  two-form  grammar  can  be  recognized  in  0(n3)  time.  One  possible 
approach  to  recognizing  an  arbitrary  grammar  in  0(n3)  time  is  to  transform  it  into  an 
equivalent  canonical  two-form  grammar  and  recognize  the  input  string  with  respect  to  the 
new  grammar.  A parse  in  the  original  grammar  could  then  be  reconstructed  from  the  parse 
that  is  obtained  in  the  transformed  canonical  two-form  grammar. 
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